The video below covers Markov chains and Monte Carlo simulations — known by the acronym MCMC — topics that are central to understanding many current developments in machine learning and filtering, and that are in my view not well taught or well understood at the moment. The goal of this article is to start with the foundations, filling in the concepts that are often skipped, and then to supplement the video with insights on adaptations and applications.
Foundations
Sample mean versus expected value
In the real world, what you have is a sample mean over $N$ measurements. It is noisy, data-driven, and reflects only what has already happened. Flip a coin ten times and get six heads: your mean is $0.6$. That is the empirical world.
The expected value, written $E(x)$, belongs to the model world. It is a theoretical statement. When you say that the expected value of a fair coin is $0.5$, you are defining a rule of the system, not measuring an outcome.
The relationship between the two is what validates models. The gap between the sample mean and the expected value must vanish as $N$ grows toward infinity. If it does not, your model is wrong or your coin is biased.
The distinction matters in the video because Monte Carlo runs this logic in reverse: when you cannot compute E(x) analytically for a complex system, you generate many random samples and let the sample mean converge to the hidden theoretical expectation.
Independent events
Independence is intuitively clear but hard to internalize in practice. Our cognitive bias insists that after many heads, tails must be due. They are not. Each flip is independent regardless of history, and yet if you keep flipping, the mean converges to $0.5$. The two facts coexist and both are true.
The mathematical statement is clean: two events $x$ and $y$ with zero mean are independent when the expected value of their product is zero: $E[xy] = 0$.
Probability density functions
A PDF models a random event by defining how $E(x)$ distributes across the domain of the variable. From $p(x)$ you can extract its moments. The expected value is the first moment: the distribution's center of mass. Variance is the second moment, giving the spread around that center.
One constraint is universal: the integral of $p(x)$ over its full domain must equal $1$. The total probability of something happening is certain: $\int_X p(x)\,dx = 1$.
Common documented shapes each capture a class of natural phenomena:
- Uniform. Every outcome equally likely: the honest die.
- Gaussian (Normal). The bell curve. Measurement error, population distributions, thermal noise in electronics.
- Exponential. Time between events: how long until a component fails.
- Poisson. Count of events in a fixed interval: server requests per second.
All of these are second-order models, fully defined by two parameters: a center and a width. They are sufficient for tools like the Kalman filter. But other methods allow p(x) to take arbitrary shapes, and that is where the central problem appears: given a complicated p(x), how do you generate samples that follow it? That is what MCMC solves.
Central Limit Theorem
The CLT governs the collective behavior of random variables. Regardless of what $p(x)$ looks like for a single observation, if you sum many independent samples from it, the distribution of that sum converges to a Gaussian. The original shape does not matter.
This is what makes Monte Carlo reliable. Once you have collected enough samples, your estimates behave predictably, with calculable error margins and confidence intervals, even when the underlying system is chaotic. 3Blue1Brown has the clearest visual treatment of this:
Stochastic processes
A deterministic signal assigns an exact value to $x(t)$ at every $t$. If you know the function, you know the future.
A stochastic process replaces that certainty with a probability law that evolves over time. Instead of a single trajectory, you have a bundle of possible ones. The process does not tell you where the system will be at the next step; it defines which states are more or less likely.
The simplest case is white noise: each observation is independent of the last, centered on a mean, with no temporal correlation. It is a theoretical construct, but the right baseline for anything that falls under the CLT. Gaussian white noise has probability: $p(x) \sim \mathcal{N}(\mu, \sigma^2)$.
A random walk integrates white noise: the path of someone moving with a completely random velocity at each step. Where they end up depends on where they are now, but each step is pure noise. This is the simplest Markov chain.
Power spectral density
The PSD describes how signal power distributes across frequency. For a stationary stochastic process $x(t)$, $S(f)$ is the Fourier transform of the autocorrelation function, which measures how correlated two observations are as the distance between them grows.
When you sample through a model $H$ (representing the Markov chain's impulse response), the output spectral density is: $S_{\text{out}}(f) = |H(f)|^2 S_{\text{in}}(f)$.
Filtering white noise ($S = 1$) produces colored noise whose frequency profile is the squared magnitude of $H$.
Development
Sampling and Markov chain dynamics
Sampling probability density functions presents real geometric challenges, especially when dealing with multimodal functions — functions where p(x) has more than one peak. Those peaks may be isolated points or entire curves of equal probability, regions the video calls the Typical Set.
A targeting example makes this concrete. You assume that a whole region around the center shares equal hit probability. In 2D the maximum is likely a circle. In 1D, two hills, one on each side of the center.
Pulling the theme toward the casino — since, after all, we are talking about Monte Carlo — another intuitive way to understand this is through a card game. If you make a bad move that scores low, there is no reason to repeat it; you look for an alternative to win. On the other hand, if the move was good, you tend to stick with it. You end up orbiting the choices that produce results.
All of these cases share a characteristic: when the shape of the distribution becomes complicated, the volume mismatch becomes an obstacle. The real interest is finding the typical regions where the data accumulates. The naive approach, grid sampling, forces you to sweep an entire cube of side $2R$. Most of the probability is concentrated in a specific region that extends through space while the rest is empty, and the mismatch is enormous. Worse, grid sampling discards information it already has: after the first sample, you have a data point, and the algorithm ignores it entirely.
If you start searching from a corner of the region, you will mostly find values near zero. There is no reason to keep spending time there. The relevant data must be elsewhere, but how far? That is where Markov chains come in.
A Markov chain has one rule: in a discrete system, the state at step $k+1$ depends only on the current state $k$ and an innovation, a noise term independent of the state and of all past history. It is a first-order stochastic process. The random walk is the simplest Markov chain.
The MCMC strategy uses the current value of $p(x)$ to guide the next step. A move that lands in a higher-probability region gets accepted; a move toward low-probability space gets rejected or accepted with lower frequency. The sampler ends up orbiting the regions that matter without sweeping the empty ones.
Hardware and bandwidth constraints
In engineering practice, sampling is usually implemented in hardware as a band-limited impulse function, clocked by the processor. The video touches on this when discussing sample rejection: when the sampling bandwidth is too narrow, the algorithm can converge to a false maximum or enter an unstable cycle.
This is underestimated in academic implementations but matters a lot in embedded and real-time applications. The Nyquist constraint does not disappear because you are sampling a probability distribution. White noise remains theoretical for exactly this reason: true white noise would require infinite bandwidth. In practice, every real sampler introduces an $H(s)$ that colors the noise.
Conclusion
The move from exhaustive sampling to MCMC is a response to the curse of dimensionality. Rejection and importance sampling methods fail exponentially as dimension grows, because the volume mismatch between the proposal region and the target grows with it. MCMC navigates the typical set with linear scaling.
It is, as far as I know, the only practical solution available for models without a closed-form distribution or with computationally prohibitive evidence, such as high-order finite mixtures. The band-limited hardware implementation imposes a filter $H(s)$ that colors the sample noise. Nyquist still applies. White noise and the Dirac delta that models its PDF remain purely theoretical constructs.