If you stumbled on this article, you may be familiar with Kalman filtering. But you may not know how to derive its algorithm from first principles. In this article, I propose a hopefully intuitive derivation from a Bayesian perspective. I assume the reader is comfortable with probability and linear algebra, but prior knowledge of the Kalman filter is not necessary.
A motivating example: asteroid trajectory
Imagine that several observatories just detected a new asteroid. Each telescope produces angular measurements of its position relative to distant stars, but the observations are imprecise: atmospheric distortion, limited resolution, and photon noise introduce non-negligible uncertainty. Early on, the object's distance is poorly constrained, and many orbital paths are compatible with the data.
From basic celestial mechanics, we know how the asteroid should move: its state evolves according to Newtonian gravity. However, the prediction is imperfect. Small non-gravitational perturbations (e.g., Yarkovsky effect) introduce unknown accelerations, creating uncertainties about the future trajectory.
As a result, the asteroid does not follow a single predicted path, but rather a cloud of possible trajectories, each consistent with what we know so far.
But because of the well-known equation βοΈ + π = π₯, it is necessary to estimate the shape of this cloud as precisely as possible for an asteroid that could potentially cross Earth's trajectory.
With each new measurement, we gather new information. Our task is to continually update our knowledge of the asteroid's state by combining:
Physical predictions, which propagate our current belief forward in time, and
Noisy observations, which provide new evidence about where the asteroid actually is.
Formalizing the problem
To address this class of problems in full generality, we abstract away the specific physics and speak only in terms of a hidden state evolving in time under uncertainty. We represent the unknown state by a vector
sβRn and encode our knowledge about it using a probability distribution.
fSβ(s)=N(s;ΞΌ,Ξ£)
We also need to model our belief/knowledge/uncertainty about the system dynamics. Again, we use a normal distribution:
FβMnΓnβ(R) the state transition matrix (F as in Forward). It encodes our "knowledge" of the system dynamics.
Ξ£S+ββ£SββMnΓnβ(R) the covariance of the process noise. It encodes our "uncertainty" about the systemβs dynamics.
The last thing we need is an observation model of the world. Observations can be imperfect, giving us partial information about the world. We model observation by a vector oβRm that follows a normal distribution:
fOβ£Sβ(oβ£s)=N(o;Ms,Ξ£Oβ£Sβ)
With:
MβMmΓnβ(R) the measurement matrix. (e.g. the matrix M=[1,0,0,0] that extract the first component of the state s)
Ξ£Oβ£SββMmΓmβ(R) the covariance of the observation noise (this noise allows us to model the uncertainty of the value measured by M).
We have now abstracted the situation mathematically.
We now want to update our belief about the state using what we know about the dynamics and the information we gather from our observations. Our objective is to update our belief about the state fSkββ to obtain fSk+1ββ.
We need to consider two cases:
We update our belief because the system has evolved by using our knowledge of the system dynamics.
We update our belief using the information gathered by observation.
Case NΒ°1: dynamics (temporal update - prediction)
If you are familiar with the Kalman filter, you may not recognise the usual form. Indeed we can do better as evaluating Ξ£+β requires inverting a nΓn matrix but the matrix Mβ€Ξ£Oβ£Sβ1βM is of rank at most m so Ξ£+β differs from Ξ£ only on a subspace of dimension m maybe we can reduce the computational complexity of the inversion from O(n) to O(m) by finding a rank n linear correction to the matrix Ξ£.
We showed that the Kalman filter can be recovered using basic probability tools & the Woodbury matrix identity. The main assumption we made is that state belief, observation noise, and evolution uncertainty are Gaussian. These assumptions make computations simpler, but the same base principles can be used without these assumptions to handle cases where a richer modelisation is needed. However, these alternative algorithms are often more computationally costly.