Markov Chain Monte What?

In this post we will delve into the main idea behind Markov Chain Monte Carlo (MCMC for short) and why it is useful within the bayesian inference framework.


April 25, 2024



Alright, folks, let’s dive into the wild world of statistics and data science! Picture this: you’re knee-deep in data, trying to make sense of the chaos. But here’s the kicker, sometimes the chaos is just too darn complex. With tons of variables flying around, getting a grip on uncertainty can feel like trying to catch smoke with your bare hands.

Please, have in your consideration that the kind of problems that we’re dealing with, it’s not solely related to the number of dimensions, it’s mostly related to trying to estimate something that we can’t see in full beforehand. For instance, consider the following banana distribution (shown below). How could we map this simple two dimensional surface without computing it all at once?

dbanana <- function(x) {
  a = 2;
  b = 0.2;
  y = x / a
  y = (a * b) * (x^2 + a^2)

x <- seq(-6, 6, length.out = 300)

y = dbanana(x)

z <- MASS::kde2d(x, y, n = 100, lims = c(-10, 10, -2.6, 20))

plot_ly(x = z$x, y = z$y, z = sqrt(z$z)) |> 
  add_surface() |> 
  style(hoverinfo = "none")

Just put a darn grid to it

You know when you hit a roadblock in your calculations, and you’re like, “Can’t we just crunch the numbers for every single value?” Well, let’s break it down. Picture a grid with \(N\) points for \(D\) dimensions. Now, brace yourself, ’cause the math needed is like \(N\) raised to the power of \(D\).

So, let’s say you wanna estimate 100 points (to get a decent estimation of the shape) for each of 100 dimensions. That’s like slamming your head against ten to the power of 200 computations… that’s a hell of a lot of computations!

Sure, in la-la land, you could approximate every single number with some degree of approximation. But let’s get real here, even if you had all the time in the world, you’d still be chipping away at those calculations until the sun swallowed the Earth, especially with continuous cases and tons of dimensions that are somewhat correlated (which in reality, tends to be the case).

This headache we’re dealing with? It’s what we “affectionately” call — emphasis on double quotes — the curse of dimensionality. It’s like trying to squeeze a square peg into a round hole… it ain’t gonna happen without a supersized hammer!

curse_dimensionality <- data.frame(dimensions = factor((1:10)^2),
                                   calculations = 100^((1:10)^2))

ggplot(curse_dimensionality, aes(dimensions, calculations)) +
  geom_col(fill = ggsci::pal_jama()(1)) +
  scale_y_continuous(transform = "log10", n.breaks = 9,
                     labels = scales::label_log(), expand = c(0,0,.1,0)) +
  labs(y = "Computations (log-scale)", x = "Dimensions (Variables)",
       title = "Computations needed to compute a grid of 100 points",
       subtitle = "As a function of dimensions/variables involved") +
  theme_classic(base_size = 20)

Illustration of computations needed (in log-scale) for 100 points as a function of dimensions considered.

Imagine you’re trying to create a grid to map out the probability space for a set of variables. As the number of dimensions increases, the number of grid points needed to adequately represent the space explodes exponentially. This means that even with the most powerful computers, it becomes practically impossible to compute all the probabilities accurately.

Sampling the unknown: Markov Chain Monte Carlo

Now, if we can’t crack the problem analytically (which, let’s face it, is the case most of the time), we gotta get creative. Lucky for us, there’s a bunch of algorithms that can lend a hand by sampling this high-dimensional parameter space. Enter the Markov Chain Monte Carlo (MCMC) family of algorithms.

But hold up — Markov Chain Monte What? Yeah, it’s a mouthful, but bear with me. You’re probably wondering how this fancy-schmancy term is connected to exploring high-dimensional probability spaces. Well, I’ll let you in on the secret sauce behind these concepts and why they’re the go-to tools in top-notch probabilistic software like Stan.

But before we get into the nitty-gritty of MCMC, let’s take a detour and talk about Markov Chains, because they’re like the OGs of this whole MCMC gang.

Understanding Markov Chains: A rainy example

Consider the following scenario: if today is rainy, the probability that tomorrow will be rainy again is 60%, but if today is sunny, the probability that tomorrow will be rainy is only 30%. However, the probability of tomorrow being sunny is 40% if today is raining, but 70% if today is sunny as well.

flowchart LR
  a((Rainy)) --->| 40% | b((Sunny))
  a -->| 60% | a
  b -->| 70% | b
  b --->| 30% | a

As you can see, the probability of a future step depends on the current step. This logic is central to Bayesian inference, as it allows us to talk about the conditional probability of a future value based on a previous one, like sampling across a continuous variable.

Converging to an answer

Now, let’s imagine letting time run. After a year passes, if we observe how the weather behaves, we’ll notice that the relative frequencies of each state tend to converge to a single number.

Now, fast forward a year. If we keep an eye on the weather every day, we’ll notice something interesting: the relative frequencies of rainy and sunny days start to settle into a rhythm. This steady state is what we call a stationary distribution. It’s like the true probability of what the weather’s gonna be like in the long run, taking into account all the different scenarios.

simulate_weather <- function(total_time) {
  weather <- vector("character", total_time) # Create slots for each day
  day <- 1 # First day
  weather[day] <- sample(c("Rainy", "Sunny"), size = 1) # Weather for first day
  while (day < total_time) {
    day <- day + 1 # Add one more day
    if (weather[day] == "Rainy") {
      weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.6, .4))
    } else {
      weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.3, .7))

sim_time <- 365*1
weather <- simulate_weather(total_time = sim_time)

weather_data <- data.frame(
  prop = c(cumsum(weather == "Rainy") / seq_len(sim_time), cumsum(weather == "Sunny") / seq_len(sim_time)),
  time = c(seq_len(sim_time), seq_len(sim_time)),
  weather = c(rep("Rainy", times = sim_time), rep("Sunny", times = sim_time))

ggplot(weather_data, aes(time, prop, fill = weather)) +
  geom_area() +
  scale_y_continuous(labels = scales::label_percent(), n.breaks = 6,
                     name = "Proportion of each weather", expand = c(0,0)) +
  scale_x_continuous(name = "Days", n.breaks = 10, expand = c(0,0)) +
  scale_fill_brewer(type = "qual", palette = 3) +
  labs(fill = "Weather", title = "Convergence to stationary distribution",
       subtitle = "Based on cumulative proportion of each Sunny or Rainy days") +
  theme_classic(base_size = 20)

Cumulative mean proportion of sunny/rainy days across 365 days. Right pass the 100 days, the proportion of rainy/sunny days tends to display a stable trend when we averaged the previous days. This is known as stationary distribution.

This heuristic allows us to naturally converge to an answer without needing to solve it analytically, which tends to be useful for really complex and high-dimensional problems.

Sure, we could’ve crunched the numbers ourselves to figure out these probabilities. But why bother with all that math when we can let time do its thing and naturally converge to the same answer? Especially when we’re dealing with complex problems that could have given even Einstein himself a headache.

The idea of convergence to a stationary distribution can be likened to taking a random walk through the space of possible outcomes. Over time, the relative frequencies of each outcome stabilize, giving us a reliable estimate of the true probabilities.

As we’ve seen, sometimes it becomes impractical to solve analytically or even approximate the posterior distribution using a grid, given the number of calculations needed to even get a decent approximation of the posterior.

However, we’ve also seen that Markov Chains might offer us a way to compute complex conditional probabilities and, if we let them run long enough, they will eventually converge to the stationary distribution, which could resemble the posterior distribution itself. So, all things considered, when does the Monte Carlo part come in?

The Need for Monte Carlo Methods

Alright, let’s break down the magic of Monte Carlo methods in plain English. Picture this: in the wacky world of random events, being able to sample from a distribution is like having a crystal ball to predict the future — pretty nifty, right?

Now, imagine we’re sampling from a normal probability density, say, with a mean of 80 and a standard deviation of 5. We grab a random sample of 10 folks, calculate their average weight, and repeat this process a thousand times.

In the following figure, we overlay the calculated sample mean, from each simulated sample using a histogram, to the population distribution from which we are sampling. As you can see, this sets an interesting opportunity, using this Monte Carlo simulation, we can get an intuition of how likely is, to our sample of 10 individuals, have a mean outside the range of 75 to 85, it’s not impossible, but it’s unlikely.

mean_weights <- matrix(data = rnorm(10 * 1000, 80, 5), nrow = 10, ncol = 1000) |> 

cols <- ggsci::pal_jama()(2)

ggplot() +
  stat_function(fun = ~dnorm(.x, 80, 5), xlim = c(60, 100), geom = "area", fill = cols[2]) +
  geom_histogram(aes(x = mean_weights, y = after_stat(density)/3.75),
               fill = cols[1], col = cols[2], binwidth = .4) +
  scale_y_continuous(expand = c(0,0), name = "Density", labels = NULL, breaks = NULL) +
  scale_x_continuous(expand = c(0,0), name = "Weight (kg)", n.breaks = 10) +
  geom_curve(data = data.frame(
    x = c(70), xend = c(74.5), y = c(0.061), yend = c(0.05)
    ), aes(x = x, xend = xend, y = y, yend = yend),
    curvature = -.2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
  geom_curve(data = data.frame(
    x = c(90), xend = c(82), y = c(0.061), yend = c(0.05)
    ), aes(x = x, xend = xend, y = y, yend = yend),
    curvature = .2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
  geom_text(aes(x = c(67, 94), y = c(0.0605), label = c("Population\ndistribution", "Means of each\nsimulated sample")), size = 6) +
  theme_classic(base_size = 20)

Sampling distribution of 10 individuals per sample overlayed to the population distribution. Each sample is drawn from a normal distribution of mean of 80 and standard deviation of 5, representing the population distribution of weight.

With each sample, we’re capturing the randomness of the population’s weight distribution. And hey, it’s not just about weight; we can simulate all sorts of wild scenarios, from multi-variable mayhem to linear model lunacy. This is the heart and soul of Monte Carlo methods: taking random shots in the dark to mimic complex processes.

But here’s the kicker: the more samples we take, the clearer the picture becomes. For instance, if we take ten times the amount of samples used, we would get a better intuition about the uncertainty around the expectation for each sample of 10 individuals, which could have important applications in the design of experiments and hypothesis testing.

And that’s where Monte Carlo methods shine. By generating a boatload of samples, we can unravel the mysteries of even the trickiest distributions, no crystal ball required. It’s a game changer for exploring the unknown without needing a PhD in rocket science.

Monte Carlo methods provide a powerful tool for approximating complex distributions by sampling from them. By generating a large number of samples, we can gain insight into the shape and properties of the distribution without needing to explicitly calculate all possible outcomes.

⁠Basics of MCMC

Alright, let’s break down the basics of MCMC. Picture this: you’ve got these two heavyweights in the world of statistics, Markov Chains and Monte Carlo methods.

On one side, you’ve got Markov Chains. These bad boys help us predict the probability of something happening based on what happened before. It’s like saying, “Hey, if it rained yesterday, what’s the chance it’ll rain again today?”

Then, there are Monte Carlo methods. These puppies work by randomly sampling from a distribution to get an idea of what the whole shebang looks like. It’s like throwing a bunch of darts at a dartboard in the dark and hoping you hit the bullseye.

However the question remains, how do they team up to tackle real-world problems?

What is MCMC actually doing?

In essence, MCMC is an algorithm that generates random samples from a proposal distribution. These samples are accepted or rejected based on how much more likely the proposed sample is compared to the previous accepted sample.

In this way, the proposed samples are accepted in the same proportion as the actual probability in the target distribution, accepting more samples that are more likely and fewer samples that are less likely.

The fascinating nature of this heuristic is that it works to approximate complex distributions without needing to know much about the shape of the final distribution.

So, think of it as trekking through this complex landscape, taking random steps (the Monte Carlo part) but guided by the likelihood of each move, given where you currently stand (the Markov Chain part). It’s a meticulous journey, but one that ultimately leads us to a better understanding of these elusive distributions.

For instance, consider that we have a distribution (shown below) that we can’t to compute, because it would take too long to integrate the whole function. This will be our target distribution, from which we can only compute the density of one value at a time.

In practice, we would derive the target distribution from the data and prior information, this enable us to estimate the density in a point-wise manner, without the need to estimate the whole PDF all at once. But for the sake of demonstration we will the use the Gamma probability density function.

However, please consider that you can’t use some family distribution to describe perfectly any probability density, sometimes it can be a mixture of distributions, truncation, censoring. All comes down to the underlying process that generates the data that we are trying to mimic.

# Target distribution that we in practice would derive from
# the data.
target_dist <- function(i) dgamma(i, shape = 2, scale = 1)

ggplot() +
  stat_function(fun = target_dist,
                xlim = c(0, 11), geom = "area", 
                fill = "#374E55FF") +
  scale_y_continuous(breaks = NULL, name = "Density", expand = c(0,0)) +
  scale_x_continuous(name = "Some scale", expand = c(0,0)) +
  theme_classic(base_size = 20)

This is a Gamma distribution with shape of 2 and scale of 1. We will try to estimate it.

Next thing to do is to specify a proposal distribution, from which we’ll generate proposals for the next step. To this end we’ll be using a Normal density function with \(\mu\) = 0 and \(\sigma\) = 1.

# This is a function that will generate proposals for the next step.
proprosal <- function() rnorm(1, mean = 0, sd = 1)

And set some algorithm parameters that are necessary for our MCMC to run:

## Algorithm parameters ----

total_steps <- 1000 # Total number of steps
step <- 1 # We start at step 1
value <- 10 # set a initial starting value

Finally, we run our algorithm as explained in previous sections. Try to follow the code to get an intuition of what is doing.

## Algorithm ----

set.seed(1234) # Seed for reproducibility
while(step < total_steps) {
  # Increase for next step
  step <- step + 1
  ## 1. Propose a new value ----
  # Proposal of the next step is ...
  value[step] <- 
    # the previous step plus...
    value[step - 1L] + 
    # a change in a random direction (based on the 
    # proposal distribution)
  ## 2. We see if the new value is more or less likely ----
  # How likely (in the target distribution)
  likelihood <- 
    # is the proposed value compared to the previous step
    target_dist(value[step]) / target_dist(value[step - 1L]) 
  ## 3. Based on its likelihood, we accept or reject it ----
  # If the proposal value is less likely, we accept it only 
  # to the likelihood of the proposed value
  if (likelihood < runif(1)) 
    value[step] <- value[step - 1L]
  # Then we repeat for the next step

Finally, let’s explore how well our algorithm converge to the target distribution.

mcmc <- data.frame(
  step = seq_len(step),
  value = value

ggplot(mcmc, aes(x = step, y = value)) +
  geom_line(col = "#374E55FF") +
  ggside::geom_ysidehistogram(aes(x = -after_stat(count)), fill = "#374E55FF", binwidth = .3) +
  ggside::geom_ysidedensity(aes(x = -after_stat(count)*.35), col = "#374E55FF") +
  ggside::scale_ysidex_continuous(expand = c(0,0,0,.1), breaks = NULL) +
  scale_x_continuous(expand = c(0,0), name = "Step") +
  scale_y_continuous(name = NULL, position = "right") +
  labs(title = "Trace of MCMC values to target distribution",
       subtitle = "Evolution of values at each step") +
  theme_classic(base_size = 20) +
  ggside::ggside(y.pos = "left") +
  theme(ggside.panel.scale = .4)

Traceplot of convergence of MCMC for 1000 steps. With increasing steps we see an increasing resemblance to the target distribution.

Another thing that we care is to see how well our MCMC is performing. After all, if not, then what would be the point of using it in first place? To check this, we’ll compare the expectation (\(E(X)\)) of the target distribution against the posterior derived from our MCMC.

For this, we have to consider that the expectation, \(E(X)\), of any Gamma distribution is equal to the shape parameter (\(\alpha\)) times by the scale parameter (\(\sigma\)). We could express the aforementioned the following.

\(\begin{aligned} E(X) &= \alpha \sigma \\ \text{with}~X &\sim \text{Gamma}(\alpha, \sigma) \end{aligned}\)

ggplot(mcmc, aes(x = step, y = cumsum(value)/step)) +
  geom_line(col = "#374E55FF") +
  scale_x_continuous(expand = c(0,.1), name = "Steps (log-scale)", 
                     transform = "log10", labels = scales::label_log()) +
  scale_y_continuous(name = NULL, expand = c(0, 1)) +
  labs(title = "Convergence to location parameter",
       subtitle = "Cumulative mean across steps") +
  geom_hline(aes(yintercept = 2), col = "darkred") +
  geom_hline(aes(yintercept = mean(value)), lty = 2) +
  annotate(x = 1.5, xend = 1.1, y = 7.5, yend = 9.5, geom = "curve", curvature = -.2,
           arrow = arrow(length = unit(.1, "in"), type = "closed")) +
  annotate(x = 2, y = 6.8,  label = "Initial value", size = 5, geom = "text") +
  annotate(x = (10^2.5), xend = (10^2.6), y = 5, yend = 2.5, geom = "curve", curvature = .2,
           arrow = arrow(length = unit(.1, "in"), type = "closed")) +
  annotate(x = (10^2.5), y = 5.8,  label = "Convergence", size = 5, geom = "text") +
  theme_classic(base_size = 20)

Cumulative mean of the posterior distributions across steps, compared to the empirical mean of the target distribution. Here the dark red line represents the empirical location parameter and the dashed line the one estimated using MCMC.

Practical Tips for the Real World

Implementing MCMC Algorithms in Practice

Alright, theory’s cool and all, but let’s get down to brass tacks. When you’re rolling up your sleeves to implement MCMC algorithms, it’s like picking the right tool for the job. Simple models? Metropolis-Hastings or Gibbs sampling has your back. But when you’re wrangling with the big boys — those complex models — that’s when you call in Hamiltonian Monte Carlo. It’s like upgrading from a rusty old wrench to a shiny new power tool. And don’t forget about tuning those parameters — it’s like fine-tuning your car for a smooth ride.

Beyond all the technical jargon, successful Bayesian inference is part gut feeling, part detective work. Picking the right priors is like seasoning a dish — you want just the right flavor without overpowering everything else. And tuning those parameters? It’s like fine-tuning your favorite instrument to make sure the music hits all the right notes.

Challenges and What Lies Ahead

But hey, nothing worth doing is ever a walk in the park, right? MCMC might be the hero of the Bayesian world, but it’s not without its challenges. Scaling up to big data? It’s like trying to squeeze into those skinny jeans from high school — uncomfortable and a bit awkward. And exploring those complex parameter spaces? It’s like navigating a maze blindfolded.

But fear not! There’s always a light at the end of the tunnel. Recent innovations in Bayesian inference, like variational inference (we’ll tackle this cousin of MCMC in the next posts) and probabilistic programming languages (like Stan), are like shiny beacons, guiding us to new horizons.

These days, some probabilistic programming languages, like Stan, use a souped-up version of the Hamiltonian Monte Carlo algorithm with hyperparameters tuned on the fly. These tools are like magic wands that turn your ideas into reality, just specify your model, and let the parameter space exploration happen in the background, no sweat.

Wrapping It Up

As we wrap up our journey into the world of MCMC, let’s take a moment to appreciate the wild ride we’ve been on. MCMC might not wear a cape, but it’s the unsung hero behind so much of what we do in Bayesian data analysis. It’s the tool that lets us dive headfirst into the murky waters of uncertainty and come out the other side with clarity and insight.

In the next installment of our MCMC series, we’ll dive into the infamous Hamiltonian Monte Carlo and unravel the statistical wizardry behind this and other similar algorithms. Until then, happy sampling!


BibTeX citation:
  author = {Castillo-Aguilar, Matías},
  title = {Markov {Chain} {Monte} {What?}},
  date = {2024-04-25},
  url = { mcmc part 1},
  doi = {10.59350/mxfyk-6av39},
  langid = {en}
For attribution, please cite this work as:
Castillo-Aguilar, Matías. 2024. “Markov Chain Monte What?” April 25, 2024.