Introduction

Deep learning is often framed as an optimization problem: define a neural network, choose a loss function, and minimize it with stochastic gradient descent. Yet beneath this procedural view lies a statistical one. Each time we minimize cross-entropy, add weight decay, or compare models using validation performance, we are implicitly invoking ideas related to Bayesian inference: uncertainty, evidence, and generalization.

This post develops a Bayesian perspective on deep learning. Rather than viewing training purely as loss minimization, we interpret it as approximate inference in a probabilistic model. This shift in viewpoint helps clarify several familiar concepts:

the role of priors and regularization,
why maximum likelihood and MAP estimation arise,
how model evidence balances data fit and complexity,
and why flat minima and highly overparameterized models can still generalize.

We will start from Bayes' rule over models and progressively move toward parametric formulations used in modern machine learning. Along the way, we will see how common approximations arise naturally from this perspective and how they relate to the geometry of the loss landscape.

A key insight will emerge: generalization is not governed simply by the number of parameters, but by the effective dimensionality of the parameter space, i.e., the number of directions that the data actually constrain.

The goal of this article is not to claim that deep learning is fully Bayesian in practice. Rather, it is to show that Bayesian reasoning provides a useful conceptual framework for understanding phenomena that might otherwise appear puzzling, such as the remarkable generalization ability of modern heavily overparameterized neural networks.

A Quick Note on Notations

Before tackling the main subject, we first need to define some notations. We will encounter two types of objects: beliefs and noise. These two objects are both represented as random variables, but their roles are very different. Beliefs describe our often imperfect knowledge of the truth. Noise limits the amount of information that an observation brings.

Random variables are denoted with capital letters (e.g. $Q$ , $A$ , $\Theta$ ) and their value with lowercase letters (e.g. $q$ , $a$ , $\theta$ ). We will also use the following abbreviations when there is no ambiguity:

Abbreviation	Full Notation
$p(x)$ / $P(x)$	$f_X(x)$ / $P(X=x)$
$p(x \vert y) / P(x \vert y)$	$f_{X\vert Y}(x\vert y)$ / $P(X=x \vert Y=y)$
$p(x, y)$ / $P(x, y)$	$f_{X, Y}(x, y)$ / $P(X=x, Y=y)$
$p_\theta(x)$ / $P_\theta(x)$	$f_{X\vert\Theta}(x \vert \theta)$ / $P(X=x \vert \Theta = \theta)$
$p_\theta(x \vert y)$ / $P_\theta(x \vert y)$	$f_{X\vert Y, \Theta}(x \vert y, \theta)$ / $P(X=x \vert Y=y, \Theta = \theta)$

The Probabilistic Foundation

Generally, problems are questions with unknown answers, for example, what will the weather be tomorrow, or is there a cat in this image? Machine learning promises to provide a general methodology to answer such questions, leveraging available information. This information is often in the form of a dataset of already answered questions. What we want is to answer new questions, or at least get the probability of possible answers.

To abstract this problem, we will note questions $x$ and answers $y$ and consider the simple case where the dataset $\mathcal{D}$ contains pairs of question and answer $(x, y)$ . Formally, we are thus interested in estimating:

P(y|x, \mathcal{D})

The probability of answer $y$ to the question $x$ given the available information in the dataset $\mathcal{D}$ for all possible $y$ .

For a question $x$ in the dataset, we can estimate $P(y|x, \mathcal{D})$ by counting the frequency of tuples $(x, y)$ in $\mathcal{D}$ . However, for a new question, there is no way to use existing data to generalize without introducing some assumptions. Indeed, without further assumption, nothing prevents answers from being totally random; the structure we may have noticed in our dataset can just be a very (un)lucky coincidence.

Introducing Models

At this point, we introduce models. Models are central in machine learning. They are candidates for how things work. Often, we use statistical models. Statistical models differ from deterministic models in a subtle but very important way: they are uncertain. This changes everything. Indeed, there is a common aphorism that states:

"All models are wrong, but some are useful."

It's not entirely true. If you say that you think temperatures tomorrow will probably be between -10°C and 40°C, you have a mental model that captures part of reality; your model of how the world works is not wrong, probably a bit imprecise, and certainly not very useful, but not wrong. And when expressing uncertainty, you don't say that the temperature tomorrow is determined by fundamentally random stuff, just that you are not able to predict it precisely. It is crucial to distinguish probabilities that describe a belief from those that describe noise, even if both are described by the same mathematical tools.

Models are generally diverse:

Some are simple, others complex.
Some explain the data well, while others do not.
Some make sharp predictions, whereas others only make broad ones.

A model is useful if it agrees with the data and makes sharp predictions which are correct. "Simple" models are often better than complex ones, but it is a story for another time.

The usefulness of models lies in their universality. When using models, we implicitly assume that the dataset and future data follow the same underlying laws. This is what allows us to generalize from available data to new data. Models are what create a connection between what we experience and the future.

Models make predictions: If the model (noted $M$ ) is true, then the answer $y$ to the question $x$ has the probability distribution $P(y|x, M)$ .

Of course, we generally don't know if a model is true. At best, we can use our experience or intuition to evaluate the plausibility of models. If we define a distribution over models representing our belief in different models from a set of candidates, the probability of an answer given a question and data becomes:

P(y|x, \mathcal{D}) = \sum_{M \in \mathcal{M}} P(y|x, M) P(M | \mathcal{D})

Although the set of all possible models is unrestricted, in practice, we work with a set of models that is rich enough to approximate any relevant one while excluding models that are implausible a priori. For example, linear regression restricts attention to linear relationships between $x$ and $y$ . Nevertheless, it can approximate more complex relationships.

In practice, we cannot evaluate all possible models and compute their likelihood. We have to rely on approximations of this sum. A common approximation consists of searching for one or a few probable models and computing the weighted average of their predictions. This approach is called "model averaging", and it is usually effective.

The prohibitive cost of averaging many models leads to the idea of posterior maximization (MAP). The principle is to search for one or a few models $M^*$ which are the most likely after observing the data:

M^* = \arg\max_{M \in \mathcal{M}} P(M|\mathcal{D})

The likelihood of models can be computed with Bayes' rule :

P(M|\mathcal{D}) = \frac{P(\mathcal{D}|M)P(M)}{P(\mathcal{D})}

The term $P(M)$ is the a priori probability of models (our guess of model probability before observing the data from $\mathcal{D}$ ) it is called the prior
The term $P(\mathcal{D}|M)$ is the likelihood of the dataset if the model $M$ is true.
The term $P(\mathcal{D})$ is a constant normalization term.

Prior over models

The prior is chosen freely by the practitioner and can theoretically be selected to get any result. It is sometimes argued that we should choose $P(M)$ following the maximum entropy principle. For example, choosing a uniform distribution over models.

However, the entropy is related to the set of models you have and how you parameterize them. For example, you could have a thousand similar models and one radically different one, or just two different models. The maximum likelihood principle will yield very different predictions. Supposing that the likelihood of observed data is the same for all models. When we make predictions, we average the models' predictions; the first case will produce a very different prediction from the second one, as in the first case, the average prediction will be dominated by the thousand similar models, while in the second case, the two different predictions will be weighted equally.

When models live in a continuous space, the maximum entropy principle is not well defined, as the entropy of a continuous distribution is not invariant to reparameterization.

So, defining a theoretically "neutral" prior is challenging; the choice of the prior is a crucial part of the modeling process and can have a significant impact on the results. However, as we will see in the next section, as more data becomes available, the influence of the prior diminishes, and the likelihood of the data given the model becomes the dominant factor in determining which models are probable.

Data vs Prior

If there is no "neutral" prior, how can we choose a reasonable one? Basic advice is to try not to introduce obviously bad biases and avoid too complex models. If you follow this advice, you should be fine, but keep in mind that the prior is important and can cause issues. However, as the dataset gets larger, the prior term becomes less important.

Indeed, consider two models $A$ & $B$ for which if $(x, y)$ is sampled from the data generating process (DGP), we have:

\left\{ \begin{align*} \mathbb{E}[\log P(y|x, A)] &= \log P_A \\ \mathbb{E}[\log P(y|x, B)] &= \log P_B \end{align*} \right.

i.e., the expected log likelihood of a sample from the DGP is $\log P_A$ for model $A$ and $\log P_B$ for model $B$ on average. If $P_A > P_B$ , we can say that model $A$ fits the data better than model $B$ . If the dataset $\mathcal{D}$ is composed of $n$ i.i.d. samples, the term $P(\mathcal{D} | M)$ can be expressed as a product over dataset samples, thus:

\begin{align*} P(M | \mathcal{D}) &\propto P(M) \prod_{(x, y)\in\mathcal{D}} P(y|x, M) \end{align*}

Taking the logarithm of the ratio of model probabilities divided by $n$ : $\frac{1}{n}\log{\frac{P(A|\mathcal{D})}{P(B|\mathcal{D})}}$ , we have that the expected value of this term is:

\begin{align*} \mathbb{E}\left[\frac{1}{n} \log{\frac{P(A|\mathcal{D})}{P(B|\mathcal{D})}}\right] &= \underbrace{\frac{1}{n} \log{\frac{P(A)}{P(B)}}}_{\rightarrow 0} + (\log P_A - \log P_B) \\ &\underset{n \rightarrow \infty}{\longrightarrow} (\log P_A - \log P_B) \end{align*}

And its variance is:

\begin{align*} \text{Var}\left(\frac{1}{n}\log{\frac{P(A|\mathcal{D})}{P(B|\mathcal{D})}}\right) &= \mathcal{O}\left(\frac{1}{n}\right) \\ &\underset{n \rightarrow \infty}{\longrightarrow} 0 \end{align*}

Thus, it converges in probability to $\log P_A - \log P_B$ :

\begin{align*} \frac{1}{n} \log{\frac{P(A|\mathcal{D})}{P(B|\mathcal{D})}} &\overset{p} {\underset{n \rightarrow \infty}{\longrightarrow}} (\log P_A- \log P_B) \\ \end{align*}

We can conclude that:

\frac{P(A|\mathcal{D})}{P(B|\mathcal{D})} \underset{n \rightarrow \infty} {\approx} \exp{n (\log P_A-\log P_B)}

Thus, when the prior is fixed, the weight of the model that best fits the data dominates the weight of the other model exponentially fast as the dataset size increases. The prior influence diminishes as more data is available. The choice of a good prior is thus less important when a lot of data is available.

However, this isn't the full story. Indeed, for finite datasets, it is not uncommon to have several different models fit the data equally well (even for large datasets). For example, when models are over-parametrized. In this case, the data likelihood $P(\mathcal{D}|M)$ is the same for all these models. In this case, only the prior influences the relative weights of selected models. In a sense, data filters out bad models, and the prior weights the remaining ones.

Parametric models

One way to define the set of possible models is the use of a parametric model. Formally we define the parametric model $\mathcal{M}$ as a set of models defined by a function $F$ that takes as input parameters $\theta \in \Theta$ and outputs a model $p_\theta$ :

\mathcal{M} = \left\{ p_\theta | \theta \in \Theta \right\}

\begin{align*} p\colon \Theta \times \mathcal{X} \times \mathcal{Y} &\longrightarrow \mathbb{R}^+ \\ (\theta, x, y) &\longmapsto p_\theta(y|x) \end{align*}

Do not get confused by the terminology "parametric model". The term "model" is used here both to refer to the function $p$ that defines the family of models or to the set of models $\mathcal{M}$ itself. In this post, we will use the term "parametric model" to refer to the set of models $\mathcal{M}$ defined by the function $p$ , and we will call $p$ the "model function". The model function is a way to define a family of models whose behavior can be tuned by changing its parameters (also called weights), noted $\theta$ .

Using parametric models allows us to substitute the search for good models with the search for good parameters, as we now have a mapping between models and parameters. We switch to the continuous Bayes rule (note that here $p$ denotes a density, not the "model function"):

p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}

With the likelihood of the answer $y$ given the question $x$ and data $\mathcal{D}$ :

\begin{align*} p(y|x, \mathcal{D}) &= \int p(y|x, \theta) p(\theta|\mathcal{D}) d\theta \end{align*}

Again, as the sum over models, this integral is generally intractable, and we need to rely on approximations. One common approximation is to search for the most likely parameters $\theta^*$ and use the model $p_{\theta^*}$ as an approximation of the true model:

\begin{align*} \theta^* &= \arg\max_{\theta \in \Theta} p(\theta|\mathcal{D}) \\ &= \arg\max_{\theta \in \Theta} p(\mathcal{D}|\theta)p(\theta) \end{align*}

Note:

This is called maximum a posteriori (MAP) estimation. If we assume a uniform prior over parameters under our chosen parameterization, this reduces to maximum likelihood estimation (MLE):

\begin{align*} \theta^* &= \arg\max_{\theta \in \Theta} p(\mathcal{D}|\theta) \end{align*}

Going beyond MAP

Often, Gaussian priors are used for parameters, leading to L2 regularization. This is a common regularization technique in deep learning that encourages the parameters to have a small norm. The rationale behind this prior is that a smaller parameter norm is related to model simplicity and thus better expected generalization. This is a good example of how the choice of the prior can have a significant impact on the results.

However, it is often overlooked that choosing the parameters that maximize the posterior is not the only way to approximate the true model. Indeed, maximizing the posterior is just one way to approximate the integral over parameters. Choosing the prediction from the model with the highest weight in the sum over models in the discrete case is quite a bold move, as this model can represent a very small fraction of the total weight of the sum over models.

In the continuous case, the situation is even worse. How can we justify approximating the full integral with a single point estimate with effectively zero probability mass? The only justification is that we can expect some regularity such that models that are close in parameter space make similar predictions.

What we can do to go beyond MAP is to consider a set $\hat{\Theta}$ of nearly equivalent models close to the one we found and compute its posterior probability mass. The problem become equivalent to finding the set $\hat{\Theta}$ that maximize:

\begin{align*} P(\theta \in \hat{\Theta} | \mathcal{D}) &= \frac{p(\mathcal{D} | \theta \in \hat{\Theta})P(\theta \in \hat{\Theta})}{p(\mathcal{D})} \\ &\propto p(\mathcal{D} | \theta \in \hat{\Theta})P(\theta \in \hat{\Theta}) \end{align*}

Equivalently, we can minimize:

\begin{align*} \mathcal{L}(\theta) &= -\log p(\mathcal{D} | \theta \in \hat{\Theta}) -\log P(\theta \in \hat{\Theta}) \end{align*}

In itself, this expression is already interesting. Intuitively, the first term corresponds to the likelihood of the data given our model; it's a term that accounts for how well we fit the data. The interpretation of the second term is more subtle. When considering a point-like ensemble of models, it is proportional to $-\log p(\theta^)$ and can be thought of as the likelihood of the model given our prior. However, now we have the negative log of a probability of an event ( $\theta \in \hat{\Theta}$ ) also known as the information content of the event (i.e. the number of bits needed to describe this event). It is thus the minimum description length (MDL) of $\theta \in \hat{\Theta}$ .

This provides a theoretical justification for the MDL (Minimum Description Length) principle, which states that when two models are equally good at explaining the data, the one with the shortest description should be chosen. This connection is classical and described in (MacKay, 2003, p. 349) as:

The log of the Occam factor can be interpreted as the amount of information we gain about the model when the data arrive.

However I think it's worth to make explicit the idea that this amount of information is directly the number of bits needed to describe the model and thus is directly related to the MDL principle. This principle is very powerful as it encapsulates ideas about the precision of model parameters and the simplicity of programs.

Intuition: Why posterior volume matters

Before diving into the derivation, it is useful to understand the intuition behind what we are trying to measure. From a Bayesian perspective, selecting a model does not only depend on how well a single parameter configuration fits the data. What matters is also how much parameter space produces similarly good predictions.

Consider two models that achieve the same training loss. The first requires parameters to be tuned very precisely: small perturbations cause the loss to increase sharply. The second admits a wide region of parameters that all perform nearly as well. Even though both models fit the data equally well at their optimum, models of the second kind occupy a much larger volume in parameter space, and thus their predictions have a larger posterior probability mass.

The goal of the following derivation is therefore to estimate the posterior mass contained in a small region around a minimum of the loss.

Let's now try to get an expression for this Loss. The term $-\log p(\mathcal{D} | \theta \in \hat{\Theta})$ can be easily approximated by considering the expected negative log likelihood:

L(\theta) = \mathbb{E}_{(x, y) \sim DGP}\left[-\log p(y|x, \theta)\right]

given a dataset $\mathcal{D}$ of size $n$ . We can approximate the likelihood of the dataset given the parameters $\theta$ by:

\begin{align*} p(\mathcal{D}|\theta) &= \prod_{(x, y)\in\mathcal{D}} p(y|x, \theta) = \exp{\sum_{(x, y)\in\mathcal{D}} \log p(y|x, \theta)} \\ &\approx \exp\left(-n \mathbb{E}_{(x, y)\sim \mathcal{D}}\left[-\log p(y|x, \theta)\right]\right) \\ &= \exp\left(-n L(\theta)\right) \end{align*}

This means that as the dataset size increases, the likelihood of the dataset given the parameters $\theta$ becomes more peaked around the parameters that minimize the expected negative log likelihood $L(\theta)$ .

For the second term, we can consider a natural set of models verifying these two conditions:

$\Vert\theta - \theta^*\Vert \leq \epsilon_\theta$ : i.e. parameters are sufficiently close to $\theta^*$ for the predictions to be similar (or for the predictions to average to the predictions of $\theta^*$ )
$nL(\theta) - nL(\theta^*) \leq \epsilon_L$ : parameters fit data reasonably well.

Approximating the second term is more involved, so we will jump directly to the result. However, you can find the derivation here:

Approximating the loss function using the second order Taylor expansion around $\theta^*$ the second condition becomes:

\begin{align*} nL(\theta) - nL(\theta^*) &\leq \epsilon_L \\ \frac{1}{2}(\theta - \theta^*)^\top H (\theta - \theta^*) &\leq \frac{\epsilon_L}{n} \\ \frac{1}{2}\Vert \theta - \theta^*\Vert_{H}^2 &\leq \frac{\epsilon_L}{n} \\ \Vert \theta - \theta^*\Vert_{H} &\leq \sqrt{\frac{2\epsilon_L}{n}} \end{align*}

The probability mass spanned by the $\theta$ that satisfy both of these conditions is the intersection of a ball and an ellipsoid, for which computation will give no closed-form formula. Nevertheless, we can approximate these hard constraints with soft Gaussian windows to get a closed-form approximation of the probability mass of this region.

$\Vert \theta - \theta^*\Vert \leq \epsilon_\theta \longrightarrow \psi_{\theta^*, \epsilon_\theta^2}(\theta) = \exp\left(-\frac{\Vert \theta - \theta^*\Vert^2}{2\epsilon_\theta^2}\right)$
$\mathcal{L}(\theta) - \mathcal{L}(\theta^*) \leq \epsilon_L \longrightarrow \psi_{\theta^*, nH / 2\epsilon_L}(\theta) = \exp\left(-\frac{(\theta - \theta^*)^\top nH (\theta - \theta^*)}{4\epsilon_L}\right)$

We can then compute the probability mass of $\hat{\Theta}$ :

\begin{align*} P(\theta \in \hat{\Theta}) &= \int p(\theta) \psi_{\theta^*, \epsilon_\theta^2}(\theta) \psi_{\theta^*, nH / 2\epsilon_L}(\theta) d\theta \\ &= \int \frac{1}{\sqrt{2\pi\sigma^2}^{d}}\exp\left(-\frac{\Vert\theta\Vert^2}{2\sigma^2}\right) \exp\left(-\frac{(\theta - \theta^*)^\top nH (\theta - \theta^*)}{4\epsilon_L}\right) \exp\left(-\frac{\Vert \theta - \theta^*\Vert^2}{2\epsilon_\theta^2}\right) d\theta \\ &= \frac{1}{\sqrt{2\pi\sigma^2}^{d}} \int \exp\left(-\frac{\Vert\theta - \theta^* + \theta^*\Vert^2}{2\sigma^2}\right) \exp\left(-\frac{(\theta - \theta^*)^\top nH (\theta - \theta^*)}{4\epsilon_L}\right) \exp\left(-\frac{\Vert \theta - \theta^*\Vert^2}{2\epsilon_\theta^2}\right) d\theta \\ &= \frac{1}{\sqrt{2\pi\sigma^2}^{d}} \int \exp\left(-\frac{\Vert\theta^*\Vert^2}{2\sigma^2}\right) \exp\left(-\frac{\theta^{*\top}(\theta - \theta^*)}{\sigma^2}\right) \exp\left(-\frac{1}{2}(\theta - \theta^*)^\top \left(\frac{I}{\sigma^2} + \frac{nH}{2\epsilon_L} + \frac{I}{\epsilon_\theta} \right) (\theta - \theta^*)\right) d\theta \end{align*}

Let's introduce the notation $A = \frac{I}{\sigma^2} + \frac{nH}{2\epsilon_L} + \frac{I}{\epsilon_\theta}$

\begin{align*} P(\theta \in \hat{\Theta}) &= \frac{1}{\sqrt{2\pi\sigma^2}^{d}} \exp\left(-\frac{\Vert\theta^*\Vert^2}{2\sigma^2}\right) \int \exp\left(-\frac{1}{2}(\theta - \theta^*)^\top A (\theta - \theta^*) -\frac{\theta^{*\top}(\theta - \theta^*)}{\sigma^2}\right) d\theta \\ &= \sigma^{-d} \left\vert A \right\vert^{-\frac{1}{2}} \exp \left(-\frac{1}{2\sigma^2} \theta^{*\top} \left(I + \frac{A^{-1}}{\sigma^2}\right) \theta^* \right) \end{align*}

And taking the negative log to get the information content of $\hat{\Theta}$ (i.e. the number of bits needed to describe the random event $\theta \in \hat{\Theta}$ ):

-\log(P(\theta \in \hat{\Theta})) = d \log{\sigma} + \frac{1}{2} \sum_{i=1}^d \log \left(\frac{n\lambda_i}{2\epsilon_L} + \frac{1}{\sigma^2} + \frac{1}{\epsilon_\theta^2} \right) + \frac{1}{2\sigma^2}\Vert \theta^* \Vert_{I + \frac{A^{-1}}{\sigma^2}}^2

Expressing it in terms of the eigenvalues of $H$ , the last term is:

\begin{align*} \frac{1}{2\sigma^2}\Vert \theta^* \Vert_{I + \frac{A^{-1}}{\sigma^2}}^2 &= \frac{1}{2\sigma^2} \sum_{i=1}^r \theta^{*2}_i \underbrace{\left(1 + \frac{1}{1 + \frac{\sigma^2 n \lambda_i}{2\epsilon_L} + \frac{\sigma^2}{\epsilon_\theta^2}}\right)}_{ \underset{n \rightarrow \infty}{\longrightarrow} 1 } + \sum_{i=r+1}^d \theta^{*2}_i \left(1 + \frac{1}{1 + \frac{\sigma^2}{\epsilon_\theta^2}}\right) \\ &\underset{n \rightarrow \infty}{\longrightarrow} \frac{1}{2\sigma^2} \sum_{i=1}^d \theta^{*2}_i + \sum_{i=r+1}^d \theta^{*2}_i \left(\frac{1}{1 + \frac{\sigma^2}{\epsilon_\theta^2}}\right) \end{align*}

Supposing that $\sigma \gg \epsilon_\theta$ , the last term becomes negligible and we get:

\begin{align*} \frac{1}{2\sigma^2}\Vert \theta^* \Vert_{I + \frac{A^{-1}}{\sigma^2}}^2 &\approx \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2 \end{align*}

Supposing that $H$ is of rank $r$ we get:

\begin{align*} -\log(P(\theta \in \hat{\Theta})) &\approx \frac{d}{2} \log{\sigma^2} + \frac{1}{2} \sum_{i=1}^r \log \underbrace{\left(\frac{n\lambda_i}{2\epsilon_L} + \frac{1}{\sigma^2} + \frac{1}{\epsilon_\theta^2} \right)}_{\underset{n \rightarrow \infty}{\approx} \frac{n\lambda_i}{2\epsilon_L}} + \frac{(d-r)}{2} \log \underbrace{\left(\frac{1}{\sigma^2} + \frac{1}{\epsilon_\theta^2} \right)}_{\approx 1/\epsilon_\theta^2} + \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2 \\ &\approx \frac{d}{2} \log\frac{\sigma^2}{\epsilon_\theta^2} + \frac{r}{2}\log n + \frac{1}{2} \sum_{i=1}^{r} \log \frac{\lambda_i}{2\epsilon_L} - \frac{r}{2} \log \frac{1}{\epsilon_\theta^2} + \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2 \\ &\approx \underbrace{d \log\frac{\sigma}{\epsilon_\theta}}_\text{regularity} + \underbrace{\frac{r}{2}\log \frac{n}{\epsilon_\theta^2}}_\text{rank} + \underbrace{\frac{1}{2} \sum_{i=1}^{r} \log \frac{\lambda_i}{2\epsilon_L}}_\text{curvature} + \underbrace{\frac{1}{2\sigma^2} \Vert \theta^* \Vert^2}_\text{prior} \end{align*}

Note: It's possible to obtain a similar result using Laplace approximation, with simpler maths, but I hope that the way I presented it here is more intuitive. In addition, the Laplace approximation derivation does not include the contribution of $\epsilon_\theta$ to account for the regularity of the parameter space.

\begin{align*} \boxed{ -\log(P(\theta \in \hat{\Theta})) \approx \underbrace{d \log\frac{\sigma}{\epsilon_\theta}}_\text{regularity} + \underbrace{\frac{r}{2}\log \frac{n}{\epsilon_\theta^2}}_\text{rank} + \underbrace{\frac{1}{2} \sum_{i=1}^{r} \log \frac{\lambda_i}{2\epsilon_L}}_\text{curvature} + \underbrace{\frac{1}{2\sigma^2} \Vert \theta^* \Vert^2}_\text{prior} } \end{align*}

This is interesting as usually in deep learning the prior term is just $-\log(p(\theta^*)) = \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2$ , here we have a much richer prior that takes into account the rank of the Hessian, number of parameters, regularity of the parameterization and the curvature of the hessian in addition to the usual prior.

We can first notice that only one term increases with $n$ , thus in the infinite data limit, we can keep only one term: the Occam factor.

\boxed{ -\log(P(\theta \in \hat{\Theta})) \approx \frac{r}{2}\log n }

This result is interesting because the Occam factor is not the usual one: $\frac{d}{2}\log{n}$ . The Occam factor is generally used in the Bayesian Information Criterion (BIC) to select models and to balance fitting and overfitting, penalizing models with a high number of parameters that can fit the data better but are more prone to overfitting. What our derivation suggests is that effective dimensionality (the rank of the Hessian), rather than raw parameter count, plays a central role. This is a candidate explanation of why a heavily overparametrized model can still generalize reasonably well.

Indeed, in deep neural networks, the loss landscape is rarely shaped like a sharp bowl. Instead, we often observe flat regions and directions along which the loss hardly changes. Mathematically, this is manifested as a Hessian with many near-zero eigenvalues (a.k.a. close to low rank matrices). Rather than a single isolated minimum, this suggests the existence of a low-dimensional manifold of equivalent or near-equivalent solutions.

This phenomenon is extensively studied in modern deep learning: flat minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) tend to correlate with better generalization; Hessians in deep nets are empirically found to be highly singular (Sagun et al., 2016); minima are connected by low-loss paths (Draxler et al., 2018; Garipov et al., 2018); and (Chizat & Bach, 2020) showed that for infinite width 2-layers ReLU networks, we have dimension independent generalization bounds.

In practice, however, the dimension of the parameter space can be quite large compared to the number of samples we have. Thus, the following approximation is often valid:

\boxed{ -\log(P(\theta \in \hat{\Theta})) \approx d \log\frac{\sigma}{\epsilon_\theta} + \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2 \smile \frac{1}{2\sigma^2} \Vert \theta^* \Vert^2 }

The first term does not depend on the choice of $\theta$ , thus we are left with an L2 weight penalty. However, the first term is related to the choice of the architecture used and how "regular" it is. Note that here we suppose that the rank of the Hessian $r$ does not increase with the number of parameters. This is actually supported by (Sagun et al., 2016) as discussed above.

Side Note on Curvature

It has been argued (Dinh et al., 2017) that the flat minima principle does not explain generalization, as it is not invariant to reparameterization. Different reparametrizations give different curvatures. One possible interpretation is that the choice of the parameterization is part of the modeling process and can be seen as a form of prior. Thus, we should not allow ourselves to change the parameterization in a post-hoc manner. In practice, works like (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) show that flat minimas are linked to better generalization.

Conclusion

Viewing deep learning through a Bayesian lens clarifies many ideas that are often introduced heuristically. Training corresponds to searching for parameters with high posterior probability, regularization reflects prior assumptions, and generalization emerges from a balance between data fit and model complexity.

From this perspective, what matters is not only how well a single parameter configuration fits the data, but how much posterior mass surrounds it. Regions of parameter space where many nearby models perform similarly well contribute more to the evidence than isolated sharp minima.

This leads to an important insight: the effective complexity of a model is not determined by the total number of parameters, but by the number of directions in parameter space that the data actually constrain. In our derivation, this appears through the rank of the Hessian rather than the raw parameter count. When many directions remain flat, large networks can still occupy broad high-probability regions and thus generalize well.

Deep learning is not fully Bayesian in practice, since exact posterior inference is infeasible. However, the Bayesian viewpoint provides a useful conceptual framework: it explains why flat minima matter, why overparameterized models can generalize, and how architectural and optimization choices implicitly shape the space of plausible models.

References

Chizat, L., & Bach, F. (2020). Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. Conference on Learning Theory, 1305–1338.

Dinh, L., Pascanu, R., Bengio, S., & Bengio, Y. (2017). Sharp Minima Can Generalize For Deep Nets. In D. Precup & Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 1019–1028). PMLR. https://proceedings.mlr.press/v70/dinh17b.html

Draxler, F., Veschgini, K., Salmhofer, M., & Hamprecht, F. (2018). Essentially no barriers in neural network energy landscape. International Conference on Machine Learning, 1309–1318.

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., & Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in Neural Information Processing Systems, 31.

Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computation.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. International Conference on Learning Representations.

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Copyright Cambridge University Press.

Sagun, L., Bottou, L., & LeCun, Y. (2016). Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond. arXiv Preprint arXiv:1611.07476.

Citing this blog post

@misc{plumerault2026bayesian,
  author = {Plumerault Antoine},
  title = {Beyond Loss Minimization: A Bayesian View of Deep Learning},
  year = {2026},
}

Beyond Loss Minimization: A Bayesian View of Deep Learning