Markov Chain Monte What?

Matías Castillo-Aguilar

doi:10.59350/mxfyk-6av39

Introduction

Alright, folks, let’s dive into the wild world of statistics and data science! Picture this: you’re knee-deep in data, trying to make sense of the chaos. But here’s the kicker, sometimes the chaos is just too darn complex. With tons of variables flying around, getting a grip on uncertainty can feel like trying to catch smoke with your bare hands.

Please, have in your consideration that the kind of problems that we’re dealing with, it’s not solely related to the number of dimensions, it’s mostly related to trying to estimate something that we can’t see in full beforehand. For instance, consider the following banana distribution (shown below). How could we map this simple two dimensional surface without computing it all at once?

Code

dbanana <- function(x) {
  a = 2;
  b = 0.2;
  
  y = x / a
  y = (a * b) * (x^2 + a^2)
}

x <- seq(-6, 6, length.out = 300)

y = dbanana(x)

z <- MASS::kde2d(x, y, n = 100, lims = c(-10, 10, -2.6, 20))

plot_ly(x = z$x, y = z$y, z = sqrt(z$z)) |> 
  add_surface() |> 
  style(hoverinfo = "none")

Just put a darn grid to it

You know when you hit a roadblock in your calculations, and you’re like, “Can’t we just crunch the numbers for every single value?” Well, let’s break it down. Picture a grid with $N$ points for $D$ dimensions. Now, brace yourself, ’cause the math needed is like $N$ raised to the power of $D$ .

So, let’s say you wanna estimate 100 points (to get a decent estimation of the shape) for each of 100 dimensions. That’s like slamming your head against ten to the power of 200 computations… that’s a hell of a lot of computations!

Sure, in la-la land, you could approximate every single number with some degree of approximation. But let’s get real here, even if you had all the time in the world, you’d still be chipping away at those calculations until the sun swallowed the Earth, especially with continuous cases and tons of dimensions that are somewhat correlated (which in reality, tends to be the case).

This headache we’re dealing with? It’s what we “affectionately” call (emphasis on double quotes) the curse of dimensionality. It’s like trying to squeeze a square peg into a round hole… it ain’t gonna happen without a supersized hammer!

Code

curse_dimensionality <- data.frame(dimensions = factor((1:10)^2),
                                   calculations = 100^((1:10)^2))

ggplot(curse_dimensionality, aes(dimensions, calculations)) +
  geom_col(fill = ggsci::pal_jama()(1)) +
  scale_y_continuous(transform = "log10", n.breaks = 9,
                     labels = scales::label_log(), expand = c(0,0,.1,0)) +
  labs(y = "Computations (log-scale)", x = "Dimensions (Variables)",
       title = "Computations needed to compute a grid of 100 points",
       subtitle = "As a function of dimensions/variables involved") +
  theme_classic(base_size = 20)

Illustration of computations needed (in log-scale) for 100 points as a function of dimensions considered.

Explaining the curse of dimensionality further

Imagine you’re trying to create a grid to map out the probability space for a set of variables. As the number of dimensions increases, the number of grid points needed to adequately represent the space explodes exponentially. This means that even with the most powerful computers, it becomes practically impossible to compute all the probabilities accurately.

Sampling the unknown: Markov Chain Monte Carlo

Now, if we can’t crack the problem analytically (which, let’s face it, is the case most of the time), we gotta get creative. Lucky for us, there’s a bunch of algorithms that can lend a hand by sampling this high-dimensional parameter space. Enter the Markov Chain Monte Carlo (MCMC) family of algorithms.

But hold up… Markov Chain Monte What? Yeah, it’s a mouthful, but bear with me. You’re probably wondering how this fancy-schmancy term is connected to exploring high-dimensional probability spaces. Well, I’ll let you in on the secret sauce behind these concepts and why they’re the go-to tools in top-notch probabilistic software like Stan.

But before we get into the nitty-gritty of MCMC, let’s take a detour and talk about Markov Chains, because they’re like the OGs of this whole MCMC gang.

Understanding Markov Chains: A rainy example

Consider the following scenario: if today is rainy, the probability that tomorrow will be rainy again is 60%, but if today is sunny, the probability that tomorrow will be rainy is only 30%. However, the probability of tomorrow being sunny is 40% if today is raining, but 70% if today is sunny as well.

As you can see, the probability of a future step depends on the current step. This logic is central to Bayesian inference, as it allows us to talk about the conditional probability of a future value based on a previous one, like sampling across a continuous variable.

Converging to an answer

Now, let’s imagine letting time run. After a year passes, if we observe how the weather behaves, we’ll notice that the relative frequencies of each state tend to converge to a single number.

Now, fast forward a year. If we keep an eye on the weather every day, we’ll notice something interesting: the relative frequencies of rainy and sunny days start to settle into a rhythm. This steady state is what we call a stationary distribution. It’s like the true probability of what the weather’s gonna be like in the long run, taking into account all the different scenarios.

Code

simulate_weather <- function(total_time) {
  
  weather <- vector("character", total_time) # Create slots for each day
  day <- 1 # First day
  weather[day] <- sample(c("Rainy", "Sunny"), size = 1) # Weather for first day
  
  while (day < total_time) {
    day <- day + 1 # Add one more day
    if (weather[day] == "Rainy") {
      weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.6, .4))
    } else {
      weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.3, .7))
    }
  }
  
  return(weather)
}

sim_time <- 365*1
weather <- simulate_weather(total_time = sim_time)

weather_data <- data.frame(
  prop = c(cumsum(weather == "Rainy") / seq_len(sim_time), cumsum(weather == "Sunny") / seq_len(sim_time)),
  time = c(seq_len(sim_time), seq_len(sim_time)),
  weather = c(rep("Rainy", times = sim_time), rep("Sunny", times = sim_time))
)

ggplot(weather_data, aes(time, prop, fill = weather)) +
  geom_area() +
  scale_y_continuous(labels = scales::label_percent(), n.breaks = 6,
                     name = "Proportion of each weather", expand = c(0,0)) +
  scale_x_continuous(name = "Days", n.breaks = 10, expand = c(0,0)) +
  scale_fill_brewer(type = "qual", palette = 3) +
  labs(fill = "Weather", title = "Convergence to stationary distribution",
       subtitle = "Based on cumulative proportion of each Sunny or Rainy days") +
  theme_classic(base_size = 20)

Cumulative mean proportion of sunny/rainy days across 365 days. Right pass the 100 days, the proportion of rainy/sunny days tends to display a stable trend when we averaged the previous days. This is known as stationary distribution.

This heuristic allows us to naturally converge to an answer without needing to solve it analytically, which tends to be useful for really complex and high-dimensional problems.

Sure, we could’ve crunched the numbers ourselves to figure out these probabilities. But why bother with all that math when we can let time do its thing and naturally converge to the same answer? Especially when we’re dealing with complex problems that could have given even Einstein himself a headache.

Explaining the convergence process further

The idea of convergence to a stationary distribution can be likened to taking a random walk through the space of possible outcomes. Over time, the relative frequencies of each outcome stabilize, giving us a reliable estimate of the true probabilities.

As we’ve seen, sometimes it becomes impractical to solve analytically or even approximate the posterior distribution using a grid, given the number of calculations needed to even get a decent approximation of the posterior.

However, we’ve also seen that Markov Chains might offer us a way to compute complex conditional probabilities and, if we let them run long enough, they will eventually converge to the stationary distribution, which could resemble the posterior distribution itself. So, all things considered, when does the Monte Carlo part come in?

The Need for Monte Carlo Methods

Alright, let’s break down the magic of Monte Carlo methods in plain English. Picture this: in the wacky world of random events, being able to sample from a distribution is like having a crystal ball to predict the future, pretty nifty, right?

Now, imagine we’re sampling from a normal probability density, say, with a mean of 80 and a standard deviation of 5. We grab a random sample of 10 folks, calculate their average weight, and repeat this process a thousand times.

In the following figure, we overlay the calculated sample mean, from each simulated sample using a histogram, to the population distribution from which we are sampling. As you can see, this sets an interesting opportunity, using this Monte Carlo simulation, we can get an intuition of how likely is, to our sample of 10 individuals, have a mean outside the range of 75 to 85, it’s not impossible, but it’s unlikely.

Code

mean_weights <- matrix(data = rnorm(10 * 1000, 80, 5), nrow = 10, ncol = 1000) |> 
  colMeans()

cols <- ggsci::pal_jama()(2)

ggplot() +
  stat_function(fun = ~dnorm(.x, 80, 5), xlim = c(60, 100), geom = "area", fill = cols[2]) +
  geom_histogram(aes(x = mean_weights, y = after_stat(density)/3.75),
               fill = cols[1], col = cols[2], binwidth = .4) +
  scale_y_continuous(expand = c(0,0), name = "Density", labels = NULL, breaks = NULL) +
  scale_x_continuous(expand = c(0,0), name = "Weight (kg)", n.breaks = 10) +
  geom_curve(data = data.frame(
    x = c(70), xend = c(74.5), y = c(0.061), yend = c(0.05)
    ), aes(x = x, xend = xend, y = y, yend = yend),
    curvature = -.2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
  geom_curve(data = data.frame(
    x = c(90), xend = c(82), y = c(0.061), yend = c(0.05)
    ), aes(x = x, xend = xend, y = y, yend = yend),
    curvature = .2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
  geom_text(aes(x = c(67, 94), y = c(0.0605), label = c("Population\ndistribution", "Means of each\nsimulated sample")), size = 6) +
  theme_classic(base_size = 20)

Sampling distribution of 10 individuals per sample overlayed to the population distribution. Each sample is drawn from a normal distribution of mean of 80 and standard deviation of 5, representing the population distribution of weight.

With each sample, we’re capturing the randomness of the population’s weight distribution. And hey, it’s not just about weight; we can simulate all sorts of wild scenarios, from multi-variable mayhem to linear model lunacy. This is the heart and soul of Monte Carlo methods: taking random shots in the dark to mimic complex processes.

But here’s the kicker: the more samples we take, the clearer the picture becomes. For instance, if we take ten times the amount of samples used, we would get a better intuition about the uncertainty around the expectation for each sample of 10 individuals, which could have important applications in the design of experiments and hypothesis testing.

And that’s where Monte Carlo methods shine. By generating a boatload of samples, we can unravel the mysteries of even the trickiest distributions, no crystal ball required. It’s a game changer for exploring the unknown without needing a PhD in rocket science.

Explaining the importance of Monte Carlo further

Monte Carlo methods provide a powerful tool for approximating complex distributions by sampling from them. By generating a large number of samples, we can gain insight into the shape and properties of the distribution without needing to explicitly calculate all possible outcomes.

⁠Basics of MCMC

Alright, let’s break down the basics of MCMC. Picture this: you’ve got these two heavyweights in the world of statistics, Markov Chains and Monte Carlo methods.

On one side, you’ve got Markov Chains. These bad boys help us predict the probability of something happening based on what happened before. It’s like saying, “Hey, if it rained yesterday, what’s the chance it’ll rain again today?”

Then, there are Monte Carlo methods. These puppies work by randomly sampling from a distribution to get an idea of what the whole shebang looks like. It’s like throwing a bunch of darts at a dartboard in the dark and hoping you hit the bullseye.

However the question remains, how do they team up to tackle real-world problems?

What is MCMC actually doing?

In essence, MCMC is an algorithm that generates random samples from a proposal distribution. These samples are accepted or rejected based on how much more likely the proposed sample is compared to the previous accepted sample.

In this way, the proposed samples are accepted in the same proportion as the actual probability in the target distribution, accepting more samples that are more likely and fewer samples that are less likely.

The fascinating nature of this heuristic is that it works to approximate complex distributions without needing to know much about the shape of the final distribution.

So, think of it as trekking through this complex landscape, taking random steps (the Monte Carlo part) but guided by the likelihood of each move, given where you currently stand (the Markov Chain part). It’s a meticulous journey, but one that ultimately leads us to a better understanding of these elusive distributions.

For instance, consider that we have a distribution (shown below) that we can’t to compute, because it would take too long to integrate the whole function. This will be our target distribution, from which we can only compute the density of one value at a time.

About the target distribution

In practice, we would derive the target distribution from the data and prior information, this enable us to estimate the density in a point-wise manner, without the need to estimate the whole PDF all at once. But for the sake of demonstration we will the use the Gamma probability density function.

However, please consider that you can’t use some family distribution to describe perfectly any probability density, sometimes it can be a mixture of distributions, truncation, censoring. All comes down to the underlying process that generates the data that we are trying to mimic.

Code

# Target distribution that we in practice would derive from
# the data.
target_dist <- function(i) dgamma(i, shape = 2, scale = 1)

ggplot() +
  stat_function(fun = target_dist,
                xlim = c(0, 11), geom = "area", 
                fill = "#374E55FF") +
  scale_y_continuous(breaks = NULL, name = "Density", expand = c(0,0)) +
  scale_x_continuous(name = "Some scale", expand = c(0,0)) +
  theme_classic(base_size = 20)

This is a Gamma distribution with shape of 2 and scale of 1. We will try to estimate it.

Next thing to do is to specify a proposal distribution, from which we’ll generate proposals for the next step. To this end we’ll be using a Normal density function with $μ$ = 0 and $σ$ = 1.

# This is a function that will generate proposals for the next step.
proprosal <- function() rnorm(1, mean = 0, sd = 1)

And set some algorithm parameters that are necessary for our MCMC to run:

## Algorithm parameters ----

total_steps <- 1000 # Total number of steps
step <- 1 # We start at step 1
value <- 10 # set a initial starting value

Finally, we run our algorithm as explained in previous sections. Try to follow the code to get an intuition of what is doing.

## Algorithm ----

set.seed(1234) # Seed for reproducibility
while(step < total_steps) {
  # Increase for next step
  step <- step + 1
  
  ## 1. Propose a new value ----
  
  # Proposal of the next step is ...
  value[step] <- 
    # the previous step plus...
    value[step - 1L] + 
    # a change in a random direction (based on the 
    # proposal distribution)
    proprosal() 
  
  ## 2. We see if the new value is more or less likely ----
  
  # How likely (in the target distribution)
  likelihood <- 
    # is the proposed value compared to the previous step
    target_dist(value[step]) / target_dist(value[step - 1L]) 
  
  ## 3. Based on its likelihood, we accept or reject it ----
  
  # If the proposal value is less likely, we accept it only 
  # to the likelihood of the proposed value
  if (likelihood < runif(1)) 
    value[step] <- value[step - 1L]
  
  # Then we repeat for the next step
}

Finally, let’s explore how well our algorithm converge to the target distribution.

Code

mcmc <- data.frame(
  step = seq_len(step),
  value = value
)

ggplot(mcmc, aes(x = step, y = value)) +
  geom_line(col = "#374E55FF") +
  ggside::geom_ysidehistogram(aes(x = -after_stat(count)), fill = "#374E55FF", binwidth = .3) +
  ggside::geom_ysidedensity(aes(x = -after_stat(count)*.35), col = "#374E55FF") +
  ggside::scale_ysidex_continuous(expand = c(0,0,0,.1), breaks = NULL) +
  scale_x_continuous(expand = c(0,0), name = "Step") +
  scale_y_continuous(name = NULL, position = "right") +
  labs(title = "Trace of MCMC values to target distribution",
       subtitle = "Evolution of values at each step") +
  theme_classic(base_size = 20) +
  ggside::ggside(y.pos = "left") +
  theme(ggside.panel.scale = .4)

Traceplot of convergence of MCMC for 1000 steps. With increasing steps we see an increasing resemblance to the target distribution.

Another thing that we care is to see how well our MCMC is performing. After all, if not, then what would be the point of using it in first place? To check this, we’ll compare the expectation ( $E (X)$ ) of the target distribution against the posterior derived from our MCMC.

For this, we have to consider that the expectation, $E (X)$ , of any Gamma distribution is equal to the shape parameter ( $α$ ) times by the scale parameter ( $σ$ ). We could express the aforementioned the following.

$\begin{aligned} E (X) & = α σ \\ with X & \sim Gamma (α, σ) \end{aligned}$

Code

ggplot(mcmc, aes(x = step, y = cumsum(value)/step)) +
  geom_line(col = "#374E55FF") +
  scale_x_continuous(expand = c(0,.1), name = "Steps (log-scale)", 
                     transform = "log10", labels = scales::label_log()) +
  scale_y_continuous(name = NULL, expand = c(0, 1)) +
  labs(title = "Convergence to location parameter",
       subtitle = "Cumulative mean across steps") +
  geom_hline(aes(yintercept = 2), col = "darkred") +
  geom_hline(aes(yintercept = mean(value)), lty = 2) +
  annotate(x = 1.5, xend = 1.1, y = 7.5, yend = 9.5, geom = "curve", curvature = -.2,
           arrow = arrow(length = unit(.1, "in"), type = "closed")) +
  annotate(x = 2, y = 6.8,  label = "Initial value", size = 5, geom = "text") +
  annotate(x = (10^2.5), xend = (10^2.6), y = 5, yend = 2.5, geom = "curve", curvature = .2,
           arrow = arrow(length = unit(.1, "in"), type = "closed")) +
  annotate(x = (10^2.5), y = 5.8,  label = "Convergence", size = 5, geom = "text") +
  theme_classic(base_size = 20)

Cumulative mean of the posterior distributions across steps, compared to the empirical mean of the target distribution. Here the dark red line represents the empirical location parameter and the dashed line the one estimated using MCMC.

⁠Popular MCMC Algorithms

This general process is central to MCMC, but more specifically to the Metropolis-Hastings algorithm. However, and in order to broaden our understanding, let’s explore additional MCMC algorithms beyond the basic Metropolis-Hastings with some simple examples.

Gibbs Sampling: A Buffet Adventure

Imagine you’re at a buffet with stations offering various cuisines (Italian, Chinese, Mexican, you name it). You’re on a mission to create a plate with a bit of everything, but here’s the catch: you can only visit one station at a time. Here’s how you tackle it:

Hit up a station and randomly pick a dish.
Move on to the next station and repeat the process.
Keep going until you’ve got a plateful of diverse flavors.

Gibbs sampling works kind of like this buffet adventure. You take turns sampling from conditional distributions, just like you visit each station for a dish. Each time, you focus on one variable, updating its value while keeping the others constant. It’s like building your plate by sampling from each cuisine until you’ve got the perfect mix.

Hamiltonian Monte Carlo: Charting Your Hiking Path

Picture yourself hiking up a rugged mountain with rocky trails and valleys. Your goal? Reach the summit without breaking a sweat (or falling off a cliff). So, you whip out your map and binoculars to plan your route:

Study the map to plot a path with minimal uphill battles and maximum flat stretches.
Use the binoculars to scout ahead and avoid obstacles along the way.
Adjust your route as you go, smoothly navigating the terrain like a seasoned pro.

Hamiltonian Monte Carlo (HMC) is a bit like this hiking adventure. It simulates a particle moving through a high-dimensional space, using gradient info to find the smoothest path. Instead of blindly wandering, HMC leverages the curvature of the target distribution to explore efficiently. It’s like hiking with a GPS that guides you around the rough spots and straight to the summit.

Strengths, Weaknesses, and Real-World Applications

Now that you’ve dipped your toes into the MCMC pool, it’s time to talk turkey (well, sampling). Each MCMC method has its perks and quirks, and knowing them is half the battle.

Gibbs sampling is the laid-back surfer dude of the group, simple, chill, and great for models with structured dependencies. But throw in some highly correlated variables, and it starts to wobble like a rookie on a surfboard.

Meanwhile, HMC is the sleek Ferrari, efficient, powerful, and perfect for tackling complex models head-on. Just don’t forget to fine-tune those parameters, or you might end up spinning out on a sharp curve.

Key Differences

Sampling Approach

Metropolis-Hastings: Takes random walks to generate samples, with acceptance based on a ratio of target distribution probabilities.
Gibbs Sampling: Updates variables one by one based on conditional distributions, like a tag team wrestling match.
Hamiltonian Monte Carlo: Glides through high-dimensional space using deterministic trajectories guided by Hamiltonian dynamics, like a graceful dancer in a crowded room.

Efficiency and Exploration

Metropolis-Hastings: Easy to implement but might struggle to explore efficiently, especially in high-dimensional spaces.
Gibbs Sampling: Perfect for structured models but may stumble with highly correlated variables.
Hamiltonian Monte Carlo: Efficiently navigates high-dimensional spaces, leading to faster convergence and smoother mixing.

Acceptance Criterion

Metropolis-Hastings: Decides whether to accept or reject proposals based on a ratio of target distribution probabilities.
Gibbs Sampling: Skips the acceptance drama and generates samples directly from conditional distributions.
Hamiltonian Monte Carlo: Judges proposals based on the joint energy of position and momentum variables, like a strict dance instructor.

Parameter Tuning and Complexity

Metropolis-Hastings: Requires tweaking the proposal distribution but keeps it simple.
Gibbs Sampling: A breeze to implement, but watch out for those conditional distributions (they can be sneaky).
Hamiltonian Monte Carlo: Needs tuning of parameters like step size and trajectory length, and the implementation might get a bit hairy with momentum variables and gradient computation.

MCMC in action

In the following, you can see an interactive animation of different MCMC algorithms (MH, Gibbs and HMC) and how they work to uncover distributions in two dimensions. The code for this animation is borrowed from Chi Feng’s github. You can find the original repository with corresponding code here: https://github.com/chi-feng/mcmc-demo

Practical Tips for the Real World

Implementing MCMC Algorithms in Practice

Alright, theory’s cool and all, but let’s get down to brass tacks. When you’re rolling up your sleeves to implement MCMC algorithms, it’s like picking the right tool for the job. Simple models? Metropolis-Hastings or Gibbs sampling has your back. But when you’re wrangling with the big boys (those complex models) that’s when you call in Hamiltonian Monte Carlo. It’s like upgrading from a rusty old wrench to a shiny new power tool. And don’t forget about tuning those parameters, it’s like fine-tuning your car for a smooth ride.

Beyond all the technical jargon, successful Bayesian inference is part gut feeling, part detective work. Picking the right priors is like seasoning a dish, you want just the right flavor without overpowering everything else. And tuning those parameters? It’s like fine-tuning your favorite instrument to make sure the music hits all the right notes.

Challenges and What Lies Ahead

But hey, nothing worth doing is ever a walk in the park, right? MCMC might be the hero of the Bayesian world, but it’s not without its challenges. Scaling up to big data? It’s like trying to squeeze into those skinny jeans from high school (uncomfortable and a bit awkward). And exploring those complex parameter spaces? It’s like navigating a maze blindfolded.

But fear not! There’s always a light at the end of the tunnel. Recent innovations in Bayesian inference, like variational inference (we’ll tackle this cousin of MCMC in a future post) and probabilistic programming languages, are like shiny beacons, guiding us to new horizons.

These days, some probabilistic programming languages, like Stan, use a souped-up version of the Hamiltonian Monte Carlo algorithm with hyperparameters tuned on the fly. These tools are like magic wands that turn your ideas into reality, just specify your model, and let the parameter space exploration happen in the background, no sweat.

Wrapping It Up

As we wrap up our journey into the world of MCMC, let’s take a moment to appreciate the wild ride we’ve been on. MCMC might not wear a cape, but it’s the hero behind so much of what we do in Bayesian data analysis. It’s the tool that lets us dive headfirst into the murky waters of uncertainty and come out the other side with clarity and insight.

In the next episode of our MCMC series, we’ll dive into the infamous Hamiltonian Monte Carlo and explore the statistical wizardry behind this and other similar algorithms. Until then, happy sampling!

Citation

BibTeX citation:

@misc{castillo-aguilar2024,
  author = {Castillo-Aguilar, Matías},
  title = {Markov {Chain} {Monte} {What?}},
  date = {2024-04-25},
  url = {https://bayesically-speaking.com/posts/2024-04-14 mcmc part 1/},
  doi = {10.59350/mxfyk-6av39},
  langid = {en}
}

For attribution, please cite this work as:

Castillo-Aguilar, Matías. 2024. “Markov Chain Monte What?” April 25, 2024. https://doi.org/10.59350/mxfyk-6av39.