close view of 67P/Tchourioumov-GuΓ©rassimenko comet by Rosetta (credit to ESA)

Β©ESA/Rosetta/NAVCAM

A Bayesian Derivation of the Kalman Filter

1 February 2026math

If you stumbled on this article, you may be familiar with Kalman filtering. But you may not know how to derive its algorithm from first principles. In this article, I propose a hopefully intuitive derivation from a Bayesian perspective. I assume the reader is comfortable with probability and linear algebra, but prior knowledge of the Kalman filter is not necessary.

A motivating example: asteroid trajectory

Imagine that several observatories just detected a new asteroid. Each telescope produces angular measurements of its position relative to distant stars, but the observations are imprecise: atmospheric distortion, limited resolution, and photon noise introduce non-negligible uncertainty. Early on, the object's distance is poorly constrained, and many orbital paths are compatible with the data.

From basic celestial mechanics, we know how the asteroid should move: its state evolves according to Newtonian gravity. However, the prediction is imperfect. Small non-gravitational perturbations (e.g., Yarkovsky effect) introduce unknown accelerations, creating uncertainties about the future trajectory.

As a result, the asteroid does not follow a single predicted path, but rather a cloud of possible trajectories, each consistent with what we know so far.

But because of the well-known equation β˜„οΈ ++ 🌍 == πŸ’₯, it is necessary to estimate the shape of this cloud as precisely as possible for an asteroid that could potentially cross Earth's trajectory.

With each new measurement, we gather new information. Our task is to continually update our knowledge of the asteroid's state by combining:

  • Physical predictions, which propagate our current belief forward in time, and
  • Noisy observations, which provide new evidence about where the asteroid actually is.

Formalizing the problem

To address this class of problems in full generality, we abstract away the specific physics and speak only in terms of a hidden state evolving in time under uncertainty. We represent the unknown state by a vector s∈Rn\mathbf{s} \in \mathbb{R}^n and encode our knowledge about it using a probability distribution.

fS(s)=N(s;ΞΌ,Ξ£)f_S(\v{s}) = \mathcal{N}(\v{s} ; \v{\mu},\v{\Sigma})

We also need to model our belief/knowledge/uncertainty about the system dynamics. Again, we use a normal distribution:

fS+∣S(s+∣s)=N(s+;Fs,ΣS+∣S)f_{S_+|S}(\v{s}_+|\v{s}) = \mathcal{N}\left(\v{s}_+;\v{F}\v{s}, \v{\Sigma}_{S_+|S} \right)

With:

  • F∈MnΓ—n(R)\v{F} \in \mathcal{M}_{n\times n}(\mathbb{R}) the state transition matrix (F\v{F} as in Forward). It encodes our "knowledge" of the system dynamics.
  • Ξ£S+∣S∈MnΓ—n(R)\v{\Sigma}_{S_+|S} \in \mathcal{M}_{n\times n}(\mathbb{R}) the covariance of the process noise. It encodes our "uncertainty" about the system’s dynamics.

The last thing we need is an observation model of the world. Observations can be imperfect, giving us partial information about the world. We model observation by a vector o∈Rm\v{o} \in \mathbb{R}^m that follows a normal distribution:

fO∣S(o∣s)=N(o;Ms,ΣO∣S)f_{O|S}(\v{o}|\v{s}) = \mathcal{N}\left(\v{o} ;\v{M}\v{s}, \v{\Sigma}_{O|S} \right)

With:

  • M∈MmΓ—n(R)\v{M} \in \mathcal{M}_{m\times n}(\mathbb{R}) the measurement matrix. (e.g. the matrix M=[1,0,0,0]\v{M} = [1, 0, 0, 0] that extract the first component of the state s\v{s})
  • Ξ£O∣S∈MmΓ—m(R)\v{\Sigma}_{O|S} \in \mathcal{M}_{m\times m}(\mathbb{R}) the covariance of the observation noise (this noise allows us to model the uncertainty of the value measured by M\v{M}).

We have now abstracted the situation mathematically.

We now want to update our belief about the state using what we know about the dynamics and the information we gather from our observations. Our objective is to update our belief about the state fSkf_{S_k} to obtain fSk+1f_{S_{k+1}}.

We need to consider two cases:

  • We update our belief because the system has evolved by using our knowledge of the system dynamics.
  • We update our belief using the information gathered by observation.

Case NΒ°1: dynamics (temporal update - prediction)

This case is straightforward. We have

fS(s)=N(s;ΞΌ,Ξ£)f_S(\v{s}) = \mathcal{N}(\v{s} ; \v{\mu},\v{\Sigma})

And

fS+∣S(s+∣s)=N(s+;Fs,ΣS+∣S)f_{S_+|S}(\v{s}_+|\v{s}) = \mathcal{N}\left(\v{s}_+;\v{F}\v{s}, \v{\Sigma}_{S_+|S} \right)

We can interpret this formula as

S+=FS+NS_+ = \v{F}S + N

With a noise N∼N(0,ΣS+∣S)N \sim \mathcal{N}(0, \v{\Sigma_{S_+|S}}). Written like this, the following result is trivial:

fS+(s+)=N(s+;ΞΌ+,Ξ£+)\begin{align*} f_{S_+}(\v{s}_+) =& \mathcal{N}(\v{s}_+; \v{\mu}_+, \v{\Sigma}_+) \end{align*}

With

{μ+=FμΣ+=FΣF⊀+ΣS+∣S\left\lbrace \begin{align*} \v{\mu}_+ &= \v{F}\v{\mu} \\ \v{\Sigma}_+ &= \v{F}\v{\Sigma}\v{F}^\top + \v{\Sigma}_{S_+|S} \end{align*} \right.

Case NΒ°2: observation (measurement update - correction)

This case is more involved.

We want to compute fS∣O(s∣ok)f_{S | O}(\v{s}| \v{o}_k) here we can use the Bayes formula:

fS∣O(s∣o)=fO∣S(o∣s)fS(s)fO(o)f_{S | O}(\v{s} | \v{o}) = \frac{f_{O|S} (\v{o}|\v{s}) f_{S}(\v{s}) }{f_{O} (\v{o})}
  • fS(s)=N(x;ΞΌ,Ξ£)f_{S} (\v{s}) = \mathcal{N}(\v{x};\v{\mu}, \v{\Sigma}) its our belief prior to the observation
  • fO∣S(o∣s)=N(o;Ms,Ξ£O∣S)f_{O|S} (\v{o}|\v{s}) = \mathcal{N}(\v{o};\v{M}\v{s}, \v{\Sigma}_{O|S}) is the likelihood of the observation knowing s\v{s}
  • fO(o)f_{O} (\v{o}) has no dependence on s\v{s}, we will treat it as the simple normalization factor it actually is.

fS∣O(s∣o)f_{S|O}(\v{s}|\v{o}) is the product of two Gaussians; it is Gaussian (it is easy to prove it using the expression of the Gaussian)

To find its parameters, we identify:

(sβˆ’ΞΌ)βŠ€Ξ£βˆ’1(sβˆ’ΞΌ)+(oβˆ’Ms)⊀ΣO∣Sβˆ’1(oβˆ’Ms)(\v{s} - \v{\mu})^\top \v{\Sigma}^{-1} (\v{s} - \v{\mu} ) + (\v{o} - \v{M} \v{s} )^\top \v{\Sigma}_{O|S} ^{-1} (\v{o} - \v{M}\v{s})

With

(sβˆ’ΞΌ+)⊀Σ+(sβˆ’ΞΌ+)+cste(\v{s} - \v{\mu}_+)^\top\v{\Sigma}_+(\v{s}-\v{\mu}_+) + \mathrm{cste}

We can group the quadratic terms in s\v{s}:

s⊀Σ+βˆ’1s=s⊀(Ξ£βˆ’1+M⊀ΣO∣Sβˆ’1M)s\v{s}^\top \v{\Sigma}_+^{-1} \v{s} = \v{s}^\top (\v{\Sigma}^{-1} + \v{M}^\top \v{\Sigma}_{O|S} ^{-1} \v{M}) \v{s}

Thus, the new sigma is:

Ξ£+=(Ξ£βˆ’1+M⊀ΣO∣Sβˆ’1M)βˆ’1\begin{align*} \v{\Sigma}_+ &= (\v{\Sigma}^{-1} + \v{M} ^\top \v{\Sigma}_{O|S} ^{-1} \v{M})^{-1} \\ \end{align*}

We can do the same with first-order terms that start with s\v{s}:

βˆ’s⊀Σ+βˆ’1ΞΌ+=βˆ’sβŠ€Ξ£βˆ’1ΞΌβˆ’s⊀M⊀ΣOβˆ’1sΞ£+βˆ’1ΞΌ+=Ξ£βˆ’1ΞΌ+M⊀ΣO∣Sβˆ’1sΞΌ+=Ξ£+(Ξ£βˆ’1ΞΌ+M⊀ΣO∣Sβˆ’1o)\begin{align*} - \v{s}^\top \v{\Sigma}_+^{-1} \v{\mu}_+ &= - \v{s}^\top \v{\Sigma}^{-1} \v{\mu} - \v{s}^\top \v{M}^\top \v{\Sigma} _O^{-1} \v{s} \\ \v{\Sigma}_+^{-1} \v{\mu}_+ &= \v{\Sigma} ^{-1} \v{\mu} + \v{M}^\top \v{\Sigma} _{O|S}^{-1} \v{s} \\ \v{\mu}_+ &= \v{\Sigma}_+(\v{\Sigma} ^{-1} \v{\mu} + \v{M}^\top \v{\Sigma} _{O|S}^{-1} \v{o}) \\ \end{align*}

This figure summarizes the belief update process:

If you are familiar with the Kalman filter, you may not recognise the usual form. Indeed we can do better as evaluating Ξ£+\v{\Sigma}_+ requires inverting a nΓ—nn\times n matrix but the matrix M⊀ΣO∣Sβˆ’1M\v{M}^\top \v{\Sigma}_{O|S} ^{-1} \v{M} is of rank at most mm so Ξ£+\v{\Sigma}_+ differs from Ξ£\v{\Sigma} only on a subspace of dimension mm maybe we can reduce the computational complexity of the inversion from O(n)\mathcal{O}(n) to O(m)\mathcal{O}(m) by finding a rank nn linear correction to the matrix Ξ£\v{\Sigma}.

The Woodbury matrix identity can help us here:

(A+UCV)βˆ’1=Aβˆ’1βˆ’Aβˆ’1U(VAβˆ’1U+Cβˆ’1)βˆ’1VAβˆ’1(\v{A} + \v{UCV})^{-1} = \v{A} ^{-1} - \v{A}^{-1}\v{U}(\v{V} \v{A} ^{-1}\v{U} + \v{C}^{-1})^{-1}\v{V}\v{A}^{-1}

Thus:

Ξ£+=(Ξ£βˆ’1+M⊀ΣO∣SM)βˆ’1=Ξ£βˆ’Ξ£M⊀(MΞ£M⊀+Ξ£O∣S)βˆ’1MΞ£\begin{align*} \v{\Sigma}_+ &= (\v{\Sigma}^{-1} + \v{M}^\top \v{\Sigma}_{O|S} \v{M})^{-1} \\ &= \v{\Sigma} - \v{\Sigma} \v{M}^\top(\v{M} \v{\Sigma} \v{M}^\top + \v{\Sigma}_{O|S})^{-1} \v{M} \Sigma \end{align*}

Here we introduce two simplifying notations:

  • Ξ£Oβ‰œMΞ£M⊀+Ξ£O∣S\v{\Sigma}_O \triangleq \v{M} \v{\Sigma} \v{M}^\top + \v{\Sigma}_{O|S}
  • Kβ‰œΞ£M⊀ΣOβˆ’1\v{K} \triangleq \v{\Sigma} \v{M}^\top \v{\Sigma}_O^{-1} is often called the optimal Kalman gain.

With this notation, we have:

Ξ£+=Ξ£βˆ’Ξ£M⊀(MΞ£M⊀+Ξ£O∣SβŸβ‰œΞ£O)βˆ’1βŸβ‰œKMΞ£=(Iβˆ’KM)Ξ£\begin{align*} \v{\Sigma}_+ &= \v{\Sigma} - \underbrace{\v{\Sigma} \v{M}^\top(\underbrace{\v{M} \v{\Sigma} \v{M}^\top + \v{\Sigma}_{O|S}}_{\triangleq\v{\Sigma}_O})^{-1}}_{\triangleq\v{K}} \v{M} \Sigma \\ &= (\v{I} - \v{K}\v{M}) \v{\Sigma} \end{align*}

And

ΞΌ+=Ξ£+(Ξ£βˆ’1ΞΌ+M⊀ΣO∣Sβˆ’1o)=(Iβˆ’KM)Ξ£(Ξ£βˆ’1ΞΌ+M⊀ΣO∣Sβˆ’1o)=(Iβˆ’KM)ΞΌ+(Iβˆ’KM)Ξ£M⊀ΣO∣Sβˆ’1o=(Iβˆ’KM)ΞΌ+Ξ£M⊀ΣO∣Sβˆ’1oβˆ’KβŸβ‰œΞ£M⊀ΣOβˆ’1(MΞ£M⊀+Ξ£O∣S⏟=Ξ£Oβˆ’Ξ£O∣S)Ξ£O∣Sβˆ’1o=(Iβˆ’KM)ΞΌ+Ξ£M⊀ΣO∣Sβˆ’1oβˆ’Ξ£M⊀ΣOβˆ’1(Ξ£Oβˆ’Ξ£O∣S)Ξ£O∣Sβˆ’1o=(Iβˆ’KM)ΞΌ+Ξ£M⊀ΣO∣Sβˆ’1oβˆ’Ξ£M⊀ΣO∣Sβˆ’1o⏟=0+Ko=ΞΌ+K(oβˆ’MΞΌ)\begin{align*} \v{\mu}_+ &= \v{\Sigma}_+(\v{\Sigma}^{-1} \v{\mu} + \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o}) \\ &= (\v{I} - \v{K}\v{M}) \v{\Sigma} (\v{\Sigma}^{-1} \v{\mu} + \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o}) \\ &= (\v{I} - \v{K}\v{M}) \mu + (\v{I} - \v{K}\v{M}) \v{\Sigma} \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o} \\ &= (\v{I} - \v{K}\v{M}) \v{\mu} + \v{\Sigma} \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o} \\ &\qquad- \underbrace{\v{K}}_{\triangleq \v{\Sigma} \v{M}^\top \v{\Sigma}_O^{-1}} (\underbrace{\v{M}\v{\Sigma} \v{M}^\top + \v{\Sigma}_{O|S}}_{=\v{\Sigma}_O} - \v{\Sigma}_{O|S}) \v{\Sigma}_{O|S}^{-1} \v{o} \\ &= (\v{I} - \v{K}\v{M}) \v{\mu} + \v{\Sigma} \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o} \\ &\qquad- \v{\Sigma} \v{M}^\top \v{\Sigma} _O^{-1} (\v{\Sigma} _O - \v{\Sigma}_{O|S}) \v{\Sigma}_{O|S}^{-1} \v{o} \\ &= (\v{I} - \v{K}\v{M}) \v{\mu} + \underbrace{\v{\Sigma} \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o} - \v{\Sigma} \v{M}^\top \v{\Sigma}_{O|S}^{-1} \v{o} }_{=0}\\ &\qquad+ \v{K}\v{o} \\ &= \v{\mu} + \v{K}(\v{o} - \v{M}\v{\mu}) \end{align*}

We recovered the usual form of the Kalman filter:

Ξ£+=(Iβˆ’KM)Σμ+=ΞΌ+K(oβˆ’MΞΌ)\begin{align*} \v{\Sigma}_+ =& (\v{I} - \v{K}\v{M}) \v{\Sigma}\\ \v{\mu}_+ =& \v{\mu} + \v{K}(\v{o} - \v{M}\v{\mu}) \end{align*}

Conclusion

We showed that the Kalman filter can be recovered using basic probability tools & the Woodbury matrix identity. The main assumption we made is that state belief, observation noise, and evolution uncertainty are Gaussian. These assumptions make computations simpler, but the same base principles can be used without these assumptions to handle cases where a richer modelisation is needed. However, these alternative algorithms are often more computationally costly.