Bayesically Speaking
https://bayesically-speaking.com/blog.html
The place where statistics, coffee and bayes theorem come toghetherquarto-1.4.551Wed, 15 May 2024 03:00:00 GMTThe Good, The Bad, and Hamiltonian Monte CarloMatías Castillo-Aguilar
https://bayesically-speaking.com/posts/2024-05-15 mcmc part 2/

Introduction

Hey there, fellow science enthusiasts and stats geeks! Welcome back to the wild world of Markov Chain Monte Carlo (MCMC) algorithms. This is part two of my series on the powerhouse behind Bayesian Inference. If you missed the first post, no worries! Just hop on over here and catch up before we dive deeper into the MCMC madness. Today, we’re exploring the notorious Hamiltonian Monte Carlo (HMC), a special kind of MCMC algorithm that taps into the dynamics of Hamiltonian mechanics.

Stats Meets Physics?

Hold up, did you say Hamiltonian mechanics? What in the world do mechanics and physics have to do with Bayesian stats? I get it, it sounds like a mashup of your wildest nightmares. But trust me, this algorithm sometimes feels like running a physics simulation in a statistical playground. Remember our chat from the last post? In Bayesian stats, we’re all about estimating the shape of a parameter space, aka the posterior distribution.

A Particle Rolling Through Stats Land

Picture this: You drop a tiny particle down a cliff, and it rolls naturally along the landscape’s curves and slopes. Easy, right? Now, swap out the real-world terrain for a funky high-dimensional probability function. That same little particle? It’s cruising through this wild statistical landscape like a boss, all thanks to the rules of Hamiltonian mechanics.

About the animation

The previous animation illustrate the Hamiltonian dynamics of a particle traveling a two-dimensional parameter space. The code for this animation is borrowed from Chi Feng’s github. You can find the original repository with corresponding code here: https://github.com/chi-feng/mcmc-demo

Hamiltonian Mechanics: A Child’s Play?

Let’s break down Hamiltonian dynamics in terms of position and momentum with a fun scenario: Imagine you’re on a swing. When you hit the highest point, you slow down, right? Your momentum’s almost zero. But here’s the kicker: You know you’re about to pick up speed on the way down, gaining momentum in the opposite direction. That moment when you’re at the top, almost motionless? That’s when you’re losing kinetic energy and gaining potential energy, thanks to gravity getting ready to pull you back down.

So, in this analogy, when your kinetic energy (think swing momentum) goes up, your potential energy (like being at the bottom of the swing) goes down. And vice versa! When your kinetic energy drops (like when you’re climbing back up), your potential energy shoots up, waiting for gravity to do its thing.

This energy dance is captured by the Hamiltonian (), which sums up the total energy in the system. It’s the sum of kinetic energy () and potential energy ():

At its core, Hamiltonian Monte Carlo (HMC) borrows from Hamiltonian dynamics, a fancy term for the rules that govern how physical systems evolve in phase space. In Hamiltonian mechanics, a system’s all about its position () and momentum (), and their dance is choreographed by Hamilton’s equations. Brace yourself, things are about to get a little mathy:

Wrapping Our Heads Around the Math

Okay, I know Hamiltonian dynamics can be a real brain-buster — trust me, it took me a hot minute to wrap my head around it. But hey, I’ve got an analogy that might just make it click. Let’s revisit our swing scenario: remember our picture of a kid on a swing, right? The swing’s angle from the vertical () tells us where the kid is, and momentum () is how fast the swing’s moving.

Now, let’s break down those equations:

This one’s like peeking into the future to see how the angle () changes over time. And guess what? It’s all about momentum (). The faster the swing’s going, the quicker it swings back and forth — simple as that!

Next up:

Now, this beauty tells us how momentum () changes over time. It’s all about the energy game here — specifically, how the swing’s position () affects its momentum. When the swing’s at the highest point, gravity’s pulling hardest, ready to send him back down.

So, picture this:

The kid swings forward, so the angle () goes up thanks to the momentum () building until bam — top of the swing.

At the top, the swing’s momentarily still, but gravity’s pulling to send him flying back down — hence, he is accumulating potential energy.

Zoom! Back down it goes, picking up speed in the opposite direction — and so, the potential energy is then transferred into kinetic energy.

All the while, the Hamiltonian () is keeping tabs on the swing’s total energy — whether it’s zooming at the bottom (high kinetic energy , as a function of momentum ) or pausing at the top (high potential energy , as a function of position ).

This dance between kinetic and potential energy is what we care within Hamiltonian mechanics, and also what we mean when we refer to the phase space, which it’s nothing more than the relationship between position and momentum.

Visualizing Hamilton’s Equations

Okay, I know we’re diving into some physics territory here in a stats blog, but trust me, understanding these concepts is key to unlocking what HMC’s all about. So, let’s take a little side trip and get a feel for Hamilton’s equations with a different example. Check out the gif below — see that weight on a string? It’s doing this cool back-and-forth dance thanks to the tug-of-war between the string pulling up and gravity pulling down.

Now, let’s get a little hands-on with some code. We’re gonna simulate a simple harmonic oscillator — you know, like that weight on a string — and watch how it moves through phase space.

# Define the potential energy function (U) and its derivative (dU/dq)U <-function(q) { k <-1# Spring constantreturn(0.5* k * q^2)}dU_dq <-function(q) { k <-1# Spring constantreturn(k * q)}# Kinetic energy (K) used for laterK <-function(p, m) {return(p^2/ (2* m))}# Introduce a damping coefficientb <-0.1# Damping coefficient# Set up initial conditionsq <--3.0# Initial positionp <-0.0# Initial momentumm <-1.0# Mass# Time parameterst_max <-20dt <-0.1num_steps <-ceiling(t_max / dt) # Ensure num_steps is an integer# Initialize arrays to store position and momentum values over timeq_values <-numeric(num_steps)p_values <-numeric(num_steps)# Perform time integration using the leapfrog methodfor (i in1:num_steps) {# Store the current values q_values[i] <- q p_values[i] <- p# Half step update for momentum with damping p_half_step <- p -0.5* dt * (dU_dq(q) + b * p / m)# Full step update for position using the momentum from the half step q <- q + dt * (p_half_step / m)# Another half step update for momentum with damping using the new position p <- p_half_step -0.5* dt * (dU_dq(q) + b * p_half_step / m)}

Code

harmonic_data <-data.table(`Position (q)`= q_values, `Momentum (p)`= p_values)ggplot(harmonic_data, aes(`Position (q)`, `Momentum (p)`)) +geom_path(linewidth = .7) +geom_point(size =2) +geom_point(data = harmonic_data[1,], col ="red", size =3) +labs(title ="Phase Space Trajectory of Hamiltonian Dynamics",subtitle ="Accounting for Energy Loss") +theme_classic(base_size =20)

Now, take a look at that graphic. See how the position () is all about where the oscillator’s hanging out, and momentum ()? Well, that’s just how fast the weight’s swinging. Put them together, and you’ve got what we call the phase space — basically, it’s like peeking into the dance floor of these mechanical systems through the lenses of Hamiltonian dynamics.

Now, in a perfect world, there’d be no energy lost over time. But hey, we like to keep it real, so we added a little something called damping effect — think of it like energy leaking out of the system over time. In the real world, that makes sense, but in our statistical playground, we want to keep that energy locked in tight. After all, losing energy means we’re losing precious info about our target distribution, and nobody wants that.

Code

hamiltonian <- harmonic_data[, list(`Total energy`=U(`Position (q)`) +K(`Momentum (p)`, m),`Kinetic energy`=K(`Momentum (p)`, m), `Potential energy`=U(`Position (q)`))]hamiltonian <-melt(hamiltonian, measure.vars =c("Total energy", "Kinetic energy", "Potential energy")) ggplot(hamiltonian, aes(rep(1:200, times =3), value, col = variable)) +geom_line(linewidth =1) +labs(y ="Energy", col ="Variable", x ="Time",title ="Fluctuation of the Total Energy in the Oscillator",subtitle ="As a Function of Kinetic and Potential Energy") +scale_color_brewer(type ="qual", palette =2) +theme_classic(base_size =20) +theme(legend.position ="top")

So, what’s the big takeaway here? Well, whether it’s a ball rolling down a hill or a sampler hunting for model coefficients, this framework’s got us covered. In Bayesian land, think of our model’s parameters as position coordinates in some space, and is the momentum helping us navigate the twists and turns of this parameter space. And with Hamiltonian dynamics leading the way, we’re guaranteed to find our path through this statistical dimension, one step at a time.

Building our own Hamiltonian Monte Carlo

Now that we got some intuition about Hamiltonian mechanics, it’s time we build our own HMC sampler. To accomplish this, imagine that we’re diving into some data to figure out how an independent variable () and a dependent variable () are related. We’re talking about linear regression — you know, trying to draw a line through those scattered data points to make sense of it all. But hey, this is Bayesian territory, so we’re not just throwing any ol’ line on that plot. No, sir! We’re exploring the parameter space of those regression coefficients — that’s the slope and intercept. Then, when the data rolls in, we’re smashing together that likelihood and those priors using Bayes’ theorem to cook up a posterior distribution — a fancy way of saying our updated beliefs about those coefficients after we’ve seen the data.

Cooking Up Some Data

But before we dive into the statistical kitchen, let’s whip up some synthetic data. Picture this: we’re mimicking a real-world scenario where relationships between variables are as murky as a foggy morning. So, we’re gonna conjure up a batch of data with a simple linear relationship, jazzed up with a sprinkle of noise. Oh, and let’s keep it small — just 20 subjects, ’cause, hey, science “loves” a manageable sample size.

set.seed(80) # Set seed for reproducibility# Define the number of data points and the range of independent variable 'x'n <-20x <-seq(1, 10, length.out = n)

Okay, so now let’s get down to business and fit ourselves a nice, cozy linear relationship between an independent variable () and a dependent variable (). We’re talking about laying down a straight line that best describes how changes with .

So, what’s our equation look like? Well, it’s pretty simple:

Hold on, let’s break it down. We’re saying that each value () is chillin’ around a mean (), like a bunch of friends at a party. And guess what? They’re all acting like good ol’ normal folks, hanging out with a variance () that tells us how spread out they are. Now, the cool part is how we define . It’s just a simple sum of an intercept () and the slope () times . Think of it like plotting points on graph paper — each tells us where we are on the -axis, and multiplying by gives us the corresponding height on the line.

Now, let’s talk numbers. We’re settingto 2 because, hey, every relationship needs a starting point, right? And for , we’re going with 3 — that’s the rate of change we’re expecting for every unit increase in . Oh, and let’s not forget about — that’s just a fancy way of saying how much our values are allowed to wiggle around.

# Define the true parameters of the linear model and the noise leveltrue_intercept <-2true_slope <-3sigma <-5

With our model all set up, it’s time to create some data points for our variable. We’ll do this using the rnorm() function, which is like a magical data generator for normally distributed variables.

# Generate the dependent variable 'y' with noisemu_i = true_intercept + true_slope * xy <-rnorm(n, mu_i, sigma)

Choosing a Target Distribution

Alright, now that we’ve got our hands on some data, it’s time to dive into the nitty-gritty of Bayesian stuff. First up, we’re gonna need our trusty log likelihood function for our linear model. This function’s like the Sherlock Holmes of statistics — it figures out the probability of seeing our data given a specific set of parameters (you know, the intercept and slope we’re trying to estimate).

# Define the log likelihood function for linear regressionlog_likelihood <-function(intercept, slope, x, y, sigma) {# We estimate the predicted response y_pred <- intercept + slope * x # Then we see how far from the observed value we are residuals <- y - y_pred# Then we estimate the likelihood associated with that error from a distribution# with no error (mean = 0)# (this is the function that we are trying to maximize) log_likelihood <-sum( dnorm(residuals, mean =0, sd = sigma, log =TRUE) )return(log_likelihood)}

So, what’s the deal with priors? Well, think of them as the background music to our data party. They’re like our initial hunches about what the parameters could be before we’ve even glanced at the data. To keep things simple, we’ll go with flat priors — no favoritism towards any particular values. It’s like saying, “Hey, let’s give everyone a fair shot!”

# Define the log prior function for the parameterslog_prior <-function(intercept, slope) {# Assuming flat priors for simplicity# (the log of 0 is 1, so it has no effect)return(0) }

Now, here’s where the real magic kicks in. We bring our likelihood and priors together in a beautiful dance to reveal the superstar of Bayesian statistics — the posterior distribution! This bad boy tells us everything we wanna know after we’ve taken the data into account. This allow us to take into account previous knowledge — like past research, and the observed data — let’s say, samples from a new experiment.

The posterior it’s nothing more than the probability associated with our parameters of interest (aka. the slope and intercept), given the observed data. We represent this posterior distribution as the following:

Which is the same as saying:

# Combine log likelihood and log prior to get log posteriorlog_posterior <-function(intercept, slope, x, y, sigma) {return(log_likelihood(intercept, slope, x, y, sigma) +log_prior(intercept, slope))}

Building the HMC sampler

Alright, now that we have everything ready, let’s dig into the guts of HMC and how we put it to work. Remember how HMC takes inspiration from the momentum and potential energy dance in physics? Well, in practice, it’s like having a GPS for our parameter space, guiding us to new spots that are more likely than others.

But here’s the thing: our parameter space isn’t some smooth highway we can cruise along endlessly. Nope, it’s more like a rugged terrain full of twists and turns. So, how do we navigate this space? Enter the leapfrog integration method, the backbone of HMC.

Leapfrog Integration

So, leapfrog integration is basically this cool math trick we use in HMC to play out how a system moves over time using discrete steps, so we don’t have to compute every single value along the way. This integration method also is advantageous by not allowing energy leaks out of the system — remember the Hamiltonian? — which is super important if you don’t want to get stuck in this statistical dimension mid exploration.

Here’s how it works in HMC-lingo: we use leapfrog integration to move around the parameter space in discrete steps — rather than sliding through a continuum, and grabbing samples from the posterior distribution. The whole process goes like this:

We give the momentum a little nudge by leveraging on the gradient info (). The gradient or slope of the position () in this parameter space will determine by how much our momentum will change. Like when we are in the top position in the swing — the potential energy then transfers to kinetic energy (aka. momentum).

We adjust the position (or parameters) based on the momentum boost.

Then we update the momentum based on the gradient on that new position.

We repeat the cycle for as many “jumps” we are doing, for each sample of the posterior we intend to draw.

Picture a frog hopping from one lily pad to another — that’s where the name “leapfrog” comes from. It helps us explore new spots in the parameter space by discretizing the motion of this imaginary particle by using Hamiltonian dynamics, using the slope information () of the current position to gain/loss momentum and move to another position in the parameter space.

We prefer leapfrogging over simpler methods like Euler’s method because it keeps errors low, both locally and globally. This stability is key, especially when we’re dealing with big, complicated systems. Plus, it’s a champ at handling high-dimensional spaces, where keeping energy in check is a must for the algorithm to converge.

Tuning Those Hamiltonian Gears

Now, to get our HMC sampler purring like a kitten, we’ve got to fine-tune a few gears. Think of these as the knobs and dials on your favorite sound system – adjust them just right, and you’ll be grooving to the perfect beat.

First up, we’ve got the number of samples. This determines how many times our sampler will take a peek at the parameter space before calling it a day.

Next, we’ve got the step size (). Imagine this as the stride length for our leapfrog integrator. Too short, and we’ll be tiptoeing; too long, and we’ll be taking giant leaps – neither of which gets us where we want to go. It’s all about finding that sweet spot.

Then, there’s the number of steps for the leapfrog to make. Too few, and we risk missing key spots; too many, and we might tire ourselves out.

Lastly, we need an initial guess for the intercept and slope. This is like dropping a pin on a map – it gives our sampler a starting point to begin its journey through the parameter space.

# Initialization of the samplernum_samples <-5000# Number of samplesepsilon <-0.05# Leapfrog step sizenum_steps <-50# Number of leapfrog stepsinit_intercept <-0# Initial guess for interceptinit_slope <-0# Initial guess for slope# Placeholder for storing samplesparams_samples <-matrix(NA, nrow = num_samples, ncol =2)

Starting the Sampler

Alright, let’s fire up this Hamiltonian engine and get this party started. Here’s the game plan:

Create a Loop: We’ll set up a loop to simulate our little particle moving around the parameter space.

Integrate Its Motion: Using our trusty leapfrog integrator, we’ll keep track of how our particle moves.

Grab a Sample: Once our particle has finished its dance, we’ll grab a sample at its final position.

Accept or Reject: We’ll play a little game of accept or reject – if the new position looks promising, we’ll keep it; if not, we’ll stick with the old one. It’s like Tinder for parameters.

Repeat: We’ll rinse and repeat this process until we’ve collected as many samples as we need.

Now, to kick things off, we’ll give our imaginary particle a random speed and direction to start with, and drop it down somewhere in the parameter space. This initial kick sets the stage for our parameter exploration, the rest is up to the physics.

for (i in1:num_samples) {# Start with a random momentum of the particle momentum_current <-rnorm(2)# And set the initial position of the particle params_proposed <-c(init_intercept, init_slope) # Next, we will simulate the particle's motion using# leapfrog integration. ...

Simulating the Particle’s Motion

Now that our imaginary particle is all geared up with an initial momentum and position, it’s time to let it loose and see where it goes in this parameter playground. We know we’re using Hamiltonian mechanics, but to make it computationally feasible, we’re bringing in our trusty leapfrog integrator. We’ve already seen that this bad boy discretizes the motion of our particle, making it manageable to track its journey without breaking our computers.

So, here’s the lowdown on what our leapfrog integrator is up to:

Estimating the Slope: We start off with an initial position and estimate the slope of the terrain at that point.

Adjusting Momentum: This slope is like the potential energy, dictating whether our particle speeds up or slows down. So, we tweak the momentum accordingly based on this slope.

Taking a Step: With this momentum tweak, we move the particle for a set distance , updating its position from the starting point.

Repeat: Now that our particle has a new spot, we rinse and repeat – estimating the slope and adjusting momentum.

Keep Going: We keep this cycle going for as many steps as we want to simulate, tracking our particle’s journey through the parameter space.

for (j in1:num_steps) {# Gradient estimation grad <-c(# Gradient of the intercept of X: -sum(residuals / variance) ...# Gradient of the slope of X: -sum(residuals * X / variance) ... )# Momentum half update momentum_current <- momentum_current - epsilon * grad *0.5# Full step update for parameters params_proposed <- params_proposed + epsilon * momentum_current# Recalculate gradient for another half step update for momentum grad <-c(# Gradient of the intercept of X: -sum(residuals / variance) ...# Gradient of the slope of X: -sum(residuals * X / variance) ... )# Final half momentum update momentum_current <- momentum_current - epsilon * grad *0.5 }

Sample Evaluation and Acceptance

After our particle has taken its fair share of steps through the parameter space, it’s decision time – should we accept or reject its proposed new position? We can’t just blindly accept every move it makes; we’ve got to be smart about it.

That’s where the Metropolis acceptance criteria come into play. This handy rule determines whether a proposed new position is a good fit or not. The idea is to weigh the probability of the new position against the probability of the current one. If the new spot looks promising, we’ll move there with a certain probability, ensuring that our samples accurately reflect the shape of the distribution we’re exploring. But if it’s not a better fit, we’ll stick with where we are.

The formula for this acceptance probability () when transitioning from the current position () to a proposed position () is straightforward:

Here, is the probability density of the proposed position , and is the probability density of the current position . We’re essentially comparing the fitness of the proposed spot against where we’re currently at. If the proposed position offers a higher probability density, we’re more likely to accept it. This ensures that our samples accurately represent the target distribution.

However, when dealing with very small probability values, we might run into numerical underflow issues. That’s where using the log posterior probabilities comes in handy. By taking the logarithm of the probabilities, we convert the ratio into a difference, making it easier to manage. Here’s how the acceptance criteria look with logarithms:

This formulation is equivalent to the previous one but helps us avoid numerical headaches, especially when working with complex or high-dimensional data. We’re still comparing the fitness of the proposed position with our current spot, just in a more log-friendly way.

Now that we’ve broken down each piece of our HMC puzzle, it’s time to put them all together and see how the full algorithm works.

for (i in1:num_samples) {# Randomly initialize momentum momentum_current <-rnorm(2)# Make a copy of the current parameters params_proposed <-c(init_intercept, init_slope)# Perform leapfrog integrationfor (j in1:num_steps) {# Half step update for momentum grad <-c(-sum((y - (params_proposed[1] + params_proposed[2] * x)) / sigma^2),-sum((y - (params_proposed[1] + params_proposed[2] * x)) * x / sigma^2) ) momentum_current <- momentum_current - epsilon * grad *0.5# Full step update for parameters params_proposed <- params_proposed + epsilon * momentum_current# Recalculate gradient for another half step update for momentum grad <-c(-sum((y - (params_proposed[1] + params_proposed[2] * x)) / sigma^2),-sum((y - (params_proposed[1] + params_proposed[2] * x)) * x / sigma^2) ) momentum_current <- momentum_current - epsilon * grad *0.5 }# Calculate the log posterior of the current and proposed parameters log_posterior_current <-log_posterior(init_intercept, init_slope, x, y, sigma) log_posterior_proposed <-log_posterior(params_proposed[1], params_proposed[2], x, y, sigma)# Calculate the acceptance probability alpha <-min(1, exp(log_posterior_proposed - log_posterior_current))# Accept or reject the proposalif (runif(1) < alpha) { init_intercept <- params_proposed[1] init_slope <- params_proposed[2] }# Save the sample params_samples[i, ] <-c(init_intercept, init_slope)}

Visualizing the Final Result

Alright, folks, let’s wrap this up with a peek at how our little algorithm fared in estimating the true intercept and slope of our linear model.

Samples of Intercept and Slope

Code

colnames(params_samples) <-c("Intercept", "Slope")posterior <-as.data.table(params_samples)posterior[, sample :=seq_len(.N)]melt_posterior <-melt(posterior, id.vars ="sample")ggplot(melt_posterior, aes(sample, value, col = variable)) +facet_grid(rows =vars(variable), scales ="free_y") +geom_line(show.legend =FALSE) +geom_hline(data =data.frame(hline =c(true_intercept, true_slope),variable =c("Intercept", "Slope") ), aes(yintercept = hline), linetype =2, linewidth =1.5) +labs(x ="Samples", y =NULL,title ="Convergence of parameter values",subtitle ="Traceplot of both the Intercept and Slope") +scale_color_brewer(type ="qual", palette =2) +scale_y_continuous(n.breaks =3) +scale_x_continuous(expand =c(0,0)) +theme_classic(base_size =20) +theme(legend.position ="top")

In this plot, we’re tracking the evolution of the intercept and slope parameters over the course of our sampling process. Each line represents a different sample from the posterior distribution, showing how these parameters fluctuate over time. The dashed lines mark the true values of the intercept and slope that we used to generate the data. Ideally, we’d like to see the samples converging around these true values, indicating that our sampler is accurately capturing the underlying structure of the data.

This plot gives us a bird’s-eye view of our data, overlaid with both the true regression line (in white) and the estimated regression lines from our HMC sampler (in blue). The true regression line represents the ground truth relationship between our independent and dependent variables, while the estimated regression lines are sampled from the accepted values of the intercept and slope parameters sampled from the posterior distribution. By comparing these two, we can assess how well our model has captured the underlying trend in the data.

In this plot, we’re visualizing the joint posterior density of the intercept and slope parameters sampled from our model. The contours represent regions of higher density, with brighter zones indicating areas where more samples are concentrated. The white dashed lines mark the true values of the intercept and slope used to generate the data, providing a reference for comparison. Ideally, we’d like to see the contours align closely with these true values, indicating that our sampler has accurately captured the underlying distribution of the parameters.

Wrapping It Up

So, let’s take a moment to marvel at the marvels of Hamiltonian Monte Carlo (HMC). It’s not every day you see physics rubbing shoulders with statistics, but here we are, with HMC straddling both worlds like a boss.

Bridging Physics and Stats

What makes HMC so darn fascinating is how it borrows tools from physics to tackle complex problems in statistics. It’s like watching two old pals team up to solve a mystery, each bringing their own unique skills to the table. With HMC, we’re not just crunching numbers; we’re tapping into the underlying principles that govern the physical world. I mean, seriously, it gives me goosebumps just thinking about it.

Riding the Simulation Wave

But it’s not just HMC that’s shaking up the stats scene. Nope, the whole world of science is pivoting towards simulation-based statistics and Bayesian methods faster than you can say “p-value.” Why? Because in today’s data-rich landscape, traditional methods just can’t keep up. We need tools like HMC to navigate the choppy waters of high-dimensional data, to tease out the subtle patterns hiding in the noise.

Unraveling the Mystery

Now, here’s the kicker: understanding how HMC works isn’t just some academic exercise. Oh no, it’s the key to unlocking the true power of Bayesian inference. Sure, you can run your models without ever peeking under the hood, but where’s the fun in that? Knowing how HMC works gives you this intuitive grasp of what your model is up to, what might be tripping it up, and how to steer it back on course when things go sideways.

So, here’s to HMC, one of the big algorithms of modern statistics, blurring the lines between physics and stats, and paving the way for a brave new world of simulation-based inference. Cheers to the leapfrogging pioneers of Bayesian stats, charting new territories and uncovering hidden truths, sliding one simulated step at a time.

Citation

BibTeX citation:

@misc{castillo-aguilar2024,
author = {Castillo-Aguilar, Matías},
title = {The {Good,} {The} {Bad,} and {Hamiltonian} {Monte} {Carlo}},
date = {2024-05-15},
url = {https://bayesically-speaking.com//posts/2024-05-15 mcmc part 2},
doi = {10.59350/fa26y-xa178},
langid = {en}
}

]]>mcmchttps://bayesically-speaking.com/posts/2024-05-15 mcmc part 2/Wed, 15 May 2024 03:00:00 GMTMarkov Chain Monte What?Matías Castillo-Aguilar
https://bayesically-speaking.com/posts/2024-04-14 mcmc part 1/

Introduction

Alright, folks, let’s dive into the wild world of statistics and data science! Picture this: you’re knee-deep in data, trying to make sense of the chaos. But here’s the kicker, sometimes the chaos is just too darn complex. With tons of variables flying around, getting a grip on uncertainty can feel like trying to catch smoke with your bare hands.

Please, have in your consideration that the kind of problems that we’re dealing with, it’s not solely related to the number of dimensions, it’s mostly related to trying to estimate something that we can’t see in full beforehand. For instance, consider the following banana distribution (shown below). How could we map this simple two dimensional surface without computing it all at once?

Code

dbanana <-function(x) { a =2; b =0.2; y = x / a y = (a * b) * (x^2+ a^2)}x <-seq(-6, 6, length.out =300)y =dbanana(x)z <- MASS::kde2d(x, y, n =100, lims =c(-10, 10, -2.6, 20))plot_ly(x = z$x, y = z$y, z =sqrt(z$z)) |>add_surface() |>style(hoverinfo ="none")

Just put a darn grid to it

You know when you hit a roadblock in your calculations, and you’re like, “Can’t we just crunch the numbers for every single value?” Well, let’s break it down. Picture a grid with points for dimensions. Now, brace yourself, ’cause the math needed is like raised to the power of .

So, let’s say you wanna estimate 100 points (to get a decent estimation of the shape) for each of 100 dimensions. That’s like slamming your head against ten to the power of 200 computations… that’s a hell of a lot of computations!

Sure, in la-la land, you could approximate every single number with some degree of approximation. But let’s get real here, even if you had all the time in the world, you’d still be chipping away at those calculations until the sun swallowed the Earth, especially with continuous cases and tons of dimensions that are somewhat correlated (which in reality, tends to be the case).

This headache we’re dealing with? It’s what we “affectionately” call — emphasis on double quotes — the curse of dimensionality. It’s like trying to squeeze a square peg into a round hole… it ain’t gonna happen without a supersized hammer!

Code

curse_dimensionality <-data.frame(dimensions =factor((1:10)^2),calculations =100^((1:10)^2))ggplot(curse_dimensionality, aes(dimensions, calculations)) +geom_col(fill = ggsci::pal_jama()(1)) +scale_y_continuous(transform ="log10", n.breaks =9,labels = scales::label_log(), expand =c(0,0,.1,0)) +labs(y ="Computations (log-scale)", x ="Dimensions (Variables)",title ="Computations needed to compute a grid of 100 points",subtitle ="As a function of dimensions/variables involved") +theme_classic(base_size =20)

Explaining the curse of dimensionality further

Imagine you’re trying to create a grid to map out the probability space for a set of variables. As the number of dimensions increases, the number of grid points needed to adequately represent the space explodes exponentially. This means that even with the most powerful computers, it becomes practically impossible to compute all the probabilities accurately.

Sampling the unknown: Markov Chain Monte Carlo

Now, if we can’t crack the problem analytically (which, let’s face it, is the case most of the time), we gotta get creative. Lucky for us, there’s a bunch of algorithms that can lend a hand by sampling this high-dimensional parameter space. Enter the Markov Chain Monte Carlo (MCMC) family of algorithms.

But hold up — Markov Chain Monte What? Yeah, it’s a mouthful, but bear with me. You’re probably wondering how this fancy-schmancy term is connected to exploring high-dimensional probability spaces. Well, I’ll let you in on the secret sauce behind these concepts and why they’re the go-to tools in top-notch probabilistic software like Stan.

But before we get into the nitty-gritty of MCMC, let’s take a detour and talk about Markov Chains, because they’re like the OGs of this whole MCMC gang.

Understanding Markov Chains: A rainy example

Consider the following scenario: if today is rainy, the probability that tomorrow will be rainy again is 60%, but if today is sunny, the probability that tomorrow will be rainy is only 30%. However, the probability of tomorrow being sunny is 40% if today is raining, but 70% if today is sunny as well.

As you can see, the probability of a future step depends on the current step. This logic is central to Bayesian inference, as it allows us to talk about the conditional probability of a future value based on a previous one, like sampling across a continuous variable.

Converging to an answer

Now, let’s imagine letting time run. After a year passes, if we observe how the weather behaves, we’ll notice that the relative frequencies of each state tend to converge to a single number.

Now, fast forward a year. If we keep an eye on the weather every day, we’ll notice something interesting: the relative frequencies of rainy and sunny days start to settle into a rhythm. This steady state is what we call a stationary distribution. It’s like the true probability of what the weather’s gonna be like in the long run, taking into account all the different scenarios.

Code

simulate_weather <-function(total_time) { weather <-vector("character", total_time) # Create slots for each day day <-1# First day weather[day] <-sample(c("Rainy", "Sunny"), size =1) # Weather for first daywhile (day < total_time) { day <- day +1# Add one more dayif (weather[day] =="Rainy") { weather[day] <-sample(c("Rainy", "Sunny"), size =1, prob =c(.6, .4)) } else { weather[day] <-sample(c("Rainy", "Sunny"), size =1, prob =c(.3, .7)) } }return(weather)}sim_time <-365*1weather <-simulate_weather(total_time = sim_time)weather_data <-data.frame(prop =c(cumsum(weather =="Rainy") /seq_len(sim_time), cumsum(weather =="Sunny") /seq_len(sim_time)),time =c(seq_len(sim_time), seq_len(sim_time)),weather =c(rep("Rainy", times = sim_time), rep("Sunny", times = sim_time)))ggplot(weather_data, aes(time, prop, fill = weather)) +geom_area() +scale_y_continuous(labels = scales::label_percent(), n.breaks =6,name ="Proportion of each weather", expand =c(0,0)) +scale_x_continuous(name ="Days", n.breaks =10, expand =c(0,0)) +scale_fill_brewer(type ="qual", palette =3) +labs(fill ="Weather", title ="Convergence to stationary distribution",subtitle ="Based on cumulative proportion of each Sunny or Rainy days") +theme_classic(base_size =20)

This heuristic allows us to naturally converge to an answer without needing to solve it analytically, which tends to be useful for really complex and high-dimensional problems.

Sure, we could’ve crunched the numbers ourselves to figure out these probabilities. But why bother with all that math when we can let time do its thing and naturally converge to the same answer? Especially when we’re dealing with complex problems that could have given even Einstein himself a headache.

Explaining the convergence process further

The idea of convergence to a stationary distribution can be likened to taking a random walk through the space of possible outcomes. Over time, the relative frequencies of each outcome stabilize, giving us a reliable estimate of the true probabilities.

As we’ve seen, sometimes it becomes impractical to solve analytically or even approximate the posterior distribution using a grid, given the number of calculations needed to even get a decent approximation of the posterior.

However, we’ve also seen that Markov Chains might offer us a way to compute complex conditional probabilities and, if we let them run long enough, they will eventually converge to the stationary distribution, which could resemble the posterior distribution itself. So, all things considered, when does the Monte Carlo part come in?

The Need for Monte Carlo Methods

Alright, let’s break down the magic of Monte Carlo methods in plain English. Picture this: in the wacky world of random events, being able to sample from a distribution is like having a crystal ball to predict the future — pretty nifty, right?

Now, imagine we’re sampling from a normal probability density, say, with a mean of 80 and a standard deviation of 5. We grab a random sample of 10 folks, calculate their average weight, and repeat this process a thousand times.

In the following figure, we overlay the calculated sample mean, from each simulated sample using a histogram, to the population distribution from which we are sampling. As you can see, this sets an interesting opportunity, using this Monte Carlo simulation, we can get an intuition of how likely is, to our sample of 10 individuals, have a mean outside the range of 75 to 85, it’s not impossible, but it’s unlikely.

Code

mean_weights <-matrix(data =rnorm(10*1000, 80, 5), nrow =10, ncol =1000) |>colMeans()cols <- ggsci::pal_jama()(2)ggplot() +stat_function(fun =~dnorm(.x, 80, 5), xlim =c(60, 100), geom ="area", fill = cols[2]) +geom_histogram(aes(x = mean_weights, y =after_stat(density)/3.75),fill = cols[1], col = cols[2], binwidth = .4) +scale_y_continuous(expand =c(0,0), name ="Density", labels =NULL, breaks =NULL) +scale_x_continuous(expand =c(0,0), name ="Weight (kg)", n.breaks =10) +geom_curve(data =data.frame(x =c(70), xend =c(74.5), y =c(0.061), yend =c(0.05) ), aes(x = x, xend = xend, y = y, yend = yend),curvature =-.2, arrow =arrow(length =unit(0.1, "inches"), type ="closed")) +geom_curve(data =data.frame(x =c(90), xend =c(82), y =c(0.061), yend =c(0.05) ), aes(x = x, xend = xend, y = y, yend = yend),curvature = .2, arrow =arrow(length =unit(0.1, "inches"), type ="closed")) +geom_text(aes(x =c(67, 94), y =c(0.0605), label =c("Population\ndistribution", "Means of each\nsimulated sample")), size =6) +theme_classic(base_size =20)

With each sample, we’re capturing the randomness of the population’s weight distribution. And hey, it’s not just about weight; we can simulate all sorts of wild scenarios, from multi-variable mayhem to linear model lunacy. This is the heart and soul of Monte Carlo methods: taking random shots in the dark to mimic complex processes.

But here’s the kicker: the more samples we take, the clearer the picture becomes. For instance, if we take ten times the amount of samples used, we would get a better intuition about the uncertainty around the expectation for each sample of 10 individuals, which could have important applications in the design of experiments and hypothesis testing.

And that’s where Monte Carlo methods shine. By generating a boatload of samples, we can unravel the mysteries of even the trickiest distributions, no crystal ball required. It’s a game changer for exploring the unknown without needing a PhD in rocket science.

Explaining the importance of Monte Carlo further

Monte Carlo methods provide a powerful tool for approximating complex distributions by sampling from them. By generating a large number of samples, we can gain insight into the shape and properties of the distribution without needing to explicitly calculate all possible outcomes.

Basics of MCMC

Alright, let’s break down the basics of MCMC. Picture this: you’ve got these two heavyweights in the world of statistics, Markov Chains and Monte Carlo methods.

On one side, you’ve got Markov Chains. These bad boys help us predict the probability of something happening based on what happened before. It’s like saying, “Hey, if it rained yesterday, what’s the chance it’ll rain again today?”

Then, there are Monte Carlo methods. These puppies work by randomly sampling from a distribution to get an idea of what the whole shebang looks like. It’s like throwing a bunch of darts at a dartboard in the dark and hoping you hit the bullseye.

However the question remains, how do they team up to tackle real-world problems?

What is MCMC actually doing?

In essence, MCMC is an algorithm that generates random samples from a proposal distribution. These samples are accepted or rejected based on how much more likely the proposed sample is compared to the previous accepted sample.

In this way, the proposed samples are accepted in the same proportion as the actual probability in the target distribution, accepting more samples that are more likely and fewer samples that are less likely.

The fascinating nature of this heuristic is that it works to approximate complex distributions without needing to know much about the shape of the final distribution.

So, think of it as trekking through this complex landscape, taking random steps (the Monte Carlo part) but guided by the likelihood of each move, given where you currently stand (the Markov Chain part). It’s a meticulous journey, but one that ultimately leads us to a better understanding of these elusive distributions.

For instance, consider that we have a distribution (shown below) that we can’t to compute, because it would take too long to integrate the whole function. This will be our target distribution, from which we can only compute the density of one value at a time.

About the target distribution

In practice, we would derive the target distribution from the data and prior information, this enable us to estimate the density in a point-wise manner, without the need to estimate the whole PDF all at once. But for the sake of demonstration we will the use the Gamma probability density function.

However, please consider that you can’t use some family distribution to describe perfectly any probability density, sometimes it can be a mixture of distributions, truncation, censoring. All comes down to the underlying process that generates the data that we are trying to mimic.

Code

# Target distribution that we in practice would derive from# the data.target_dist <-function(i) dgamma(i, shape =2, scale =1)ggplot() +stat_function(fun = target_dist,xlim =c(0, 11), geom ="area", fill ="#374E55FF") +scale_y_continuous(breaks =NULL, name ="Density", expand =c(0,0)) +scale_x_continuous(name ="Some scale", expand =c(0,0)) +theme_classic(base_size =20)

Next thing to do is to specify a proposal distribution, from which we’ll generate proposals for the next step. To this end we’ll be using a Normal density function with = 0 and = 1.

# This is a function that will generate proposals for the next step.proprosal <-function() rnorm(1, mean =0, sd =1)

And set some algorithm parameters that are necessary for our MCMC to run:

## Algorithm parameters ----total_steps <-1000# Total number of stepsstep <-1# We start at step 1value <-10# set a initial starting value

Finally, we run our algorithm as explained in previous sections. Try to follow the code to get an intuition of what is doing.

## Algorithm ----set.seed(1234) # Seed for reproducibilitywhile(step < total_steps) {# Increase for next step step <- step +1## 1. Propose a new value ----# Proposal of the next step is ... value[step] <-# the previous step plus... value[step -1L] +# a change in a random direction (based on the # proposal distribution)proprosal() ## 2. We see if the new value is more or less likely ----# How likely (in the target distribution) likelihood <-# is the proposed value compared to the previous steptarget_dist(value[step]) /target_dist(value[step -1L]) ## 3. Based on its likelihood, we accept or reject it ----# If the proposal value is less likely, we accept it only # to the likelihood of the proposed valueif (likelihood <runif(1)) value[step] <- value[step -1L]# Then we repeat for the next step}

Finally, let’s explore how well our algorithm converge to the target distribution.

Code

mcmc <-data.frame(step =seq_len(step),value = value)ggplot(mcmc, aes(x = step, y = value)) +geom_line(col ="#374E55FF") + ggside::geom_ysidehistogram(aes(x =-after_stat(count)), fill ="#374E55FF", binwidth = .3) + ggside::geom_ysidedensity(aes(x =-after_stat(count)*.35), col ="#374E55FF") + ggside::scale_ysidex_continuous(expand =c(0,0,0,.1), breaks =NULL) +scale_x_continuous(expand =c(0,0), name ="Step") +scale_y_continuous(name =NULL, position ="right") +labs(title ="Trace of MCMC values to target distribution",subtitle ="Evolution of values at each step") +theme_classic(base_size =20) + ggside::ggside(y.pos ="left") +theme(ggside.panel.scale = .4)

Another thing that we care is to see how well our MCMC is performing. After all, if not, then what would be the point of using it in first place? To check this, we’ll compare the expectation () of the target distribution against the posterior derived from our MCMC.

For this, we have to consider that the expectation, , of any Gamma distribution is equal to the shape parameter () times by the scale parameter (). We could express the aforementioned the following.

Code

ggplot(mcmc, aes(x = step, y =cumsum(value)/step)) +geom_line(col ="#374E55FF") +scale_x_continuous(expand =c(0,.1), name ="Steps (log-scale)", transform ="log10", labels = scales::label_log()) +scale_y_continuous(name =NULL, expand =c(0, 1)) +labs(title ="Convergence to location parameter",subtitle ="Cumulative mean across steps") +geom_hline(aes(yintercept =2), col ="darkred") +geom_hline(aes(yintercept =mean(value)), lty =2) +annotate(x =1.5, xend =1.1, y =7.5, yend =9.5, geom ="curve", curvature =-.2,arrow =arrow(length =unit(.1, "in"), type ="closed")) +annotate(x =2, y =6.8, label ="Initial value", size =5, geom ="text") +annotate(x = (10^2.5), xend = (10^2.6), y =5, yend =2.5, geom ="curve", curvature = .2,arrow =arrow(length =unit(.1, "in"), type ="closed")) +annotate(x = (10^2.5), y =5.8, label ="Convergence", size =5, geom ="text") +theme_classic(base_size =20)

Popular MCMC Algorithms

This general process is central to MCMC, but more specifically to the Metropolis-Hastings algorithm. However, and in order to broaden our understanding, let’s explore additional MCMC algorithms beyond the basic Metropolis-Hastings with some simple examples.

Gibbs Sampling: A Buffet Adventure

Imagine you’re at a buffet with stations offering various cuisines — Italian, Chinese, Mexican — you name it. You’re on a mission to create a plate with a bit of everything, but here’s the catch: you can only visit one station at a time. Here’s how you tackle it:

Hit up a station and randomly pick a dish.

Move on to the next station and repeat the process.

Keep going until you’ve got a plateful of diverse flavors.

Gibbs sampling works kind of like this buffet adventure. You take turns sampling from conditional distributions, just like you visit each station for a dish. Each time, you focus on one variable, updating its value while keeping the others constant. It’s like building your plate by sampling from each cuisine until you’ve got the perfect mix.

Hamiltonian Monte Carlo: Charting Your Hiking Path

Picture yourself hiking up a rugged mountain with rocky trails and valleys. Your goal? Reach the summit without breaking a sweat — or falling off a cliff. So, you whip out your map and binoculars to plan your route:

Study the map to plot a path with minimal uphill battles and maximum flat stretches.

Use the binoculars to scout ahead and avoid obstacles along the way.

Adjust your route as you go, smoothly navigating the terrain like a seasoned pro.

Hamiltonian Monte Carlo (HMC) is a bit like this hiking adventure. It simulates a particle moving through a high-dimensional space, using gradient info to find the smoothest path. Instead of blindly wandering, HMC leverages the curvature of the target distribution to explore efficiently. It’s like hiking with a GPS that guides you around the rough spots and straight to the summit.

Strengths, Weaknesses, and Real-World Applications

Now that you’ve dipped your toes into the MCMC pool, it’s time to talk turkey — well, sampling. Each MCMC method has its perks and quirks, and knowing them is half the battle.

Gibbs sampling is the laid-back surfer dude of the group — simple, chill, and great for models with structured dependencies. But throw in some highly correlated variables, and it starts to wobble like a rookie on a surfboard.

Meanwhile, HMC is the sleek Ferrari — efficient, powerful, and perfect for tackling complex models head-on. Just don’t forget to fine-tune those parameters, or you might end up spinning out on a sharp curve.

Key Differences

Sampling Approach

Metropolis-Hastings: Takes random walks to generate samples, with acceptance based on a ratio of target distribution probabilities.

Gibbs Sampling: Updates variables one by one based on conditional distributions, like a tag team wrestling match.

Hamiltonian Monte Carlo: Glides through high-dimensional space using deterministic trajectories guided by Hamiltonian dynamics, like a graceful dancer in a crowded room.

Efficiency and Exploration

Metropolis-Hastings: Easy to implement but might struggle to explore efficiently, especially in high-dimensional spaces.

Gibbs Sampling: Perfect for structured models but may stumble with highly correlated variables.

Hamiltonian Monte Carlo: Efficiently navigates high-dimensional spaces, leading to faster convergence and smoother mixing.

Acceptance Criterion

Metropolis-Hastings: Decides whether to accept or reject proposals based on a ratio of target distribution probabilities.

Gibbs Sampling: Skips the acceptance drama and generates samples directly from conditional distributions.

Hamiltonian Monte Carlo: Judges proposals based on the joint energy of position and momentum variables, like a strict dance instructor.

Parameter Tuning and Complexity

Metropolis-Hastings: Requires tweaking the proposal distribution but keeps it simple.

Gibbs Sampling: A breeze to implement, but watch out for those conditional distributions — they can be sneaky.

Hamiltonian Monte Carlo: Needs tuning of parameters like step size and trajectory length, and the implementation might get a bit hairy with momentum variables and gradient computation.

MCMC in action

In the following, you can see an interactive animation of different MCMC algorithms (MH, Gibbs and HMC) and how they work to uncover distributions in two dimensions. The code for this animation is borrowed from Chi Feng’s github. You can find the original repository with corresponding code here: https://github.com/chi-feng/mcmc-demo

Practical Tips for the Real World

Implementing MCMC Algorithms in Practice

Alright, theory’s cool and all, but let’s get down to brass tacks. When you’re rolling up your sleeves to implement MCMC algorithms, it’s like picking the right tool for the job. Simple models? Metropolis-Hastings or Gibbs sampling has your back. But when you’re wrangling with the big boys — those complex models — that’s when you call in Hamiltonian Monte Carlo. It’s like upgrading from a rusty old wrench to a shiny new power tool. And don’t forget about tuning those parameters — it’s like fine-tuning your car for a smooth ride.

Beyond all the technical jargon, successful Bayesian inference is part gut feeling, part detective work. Picking the right priors is like seasoning a dish — you want just the right flavor without overpowering everything else. And tuning those parameters? It’s like fine-tuning your favorite instrument to make sure the music hits all the right notes.

Challenges and What Lies Ahead

But hey, nothing worth doing is ever a walk in the park, right? MCMC might be the hero of the Bayesian world, but it’s not without its challenges. Scaling up to big data? It’s like trying to squeeze into those skinny jeans from high school — uncomfortable and a bit awkward. And exploring those complex parameter spaces? It’s like navigating a maze blindfolded.

But fear not! There’s always a light at the end of the tunnel. Recent innovations in Bayesian inference, like variational inference (we’ll tackle this cousin of MCMC in the next posts) and probabilistic programming languages (like Stan), are like shiny beacons, guiding us to new horizons.

These days, some probabilistic programming languages, like Stan, use a souped-up version of the Hamiltonian Monte Carlo algorithm with hyperparameters tuned on the fly. These tools are like magic wands that turn your ideas into reality, just specify your model, and let the parameter space exploration happen in the background, no sweat.

Wrapping It Up

As we wrap up our journey into the world of MCMC, let’s take a moment to appreciate the wild ride we’ve been on. MCMC might not wear a cape, but it’s the unsung hero behind so much of what we do in Bayesian data analysis. It’s the tool that lets us dive headfirst into the murky waters of uncertainty and come out the other side with clarity and insight.

In the next installment of our MCMC series, we’ll dive into the infamous Hamiltonian Monte Carlo and unravel the statistical wizardry behind this and other similar algorithms. Until then, happy sampling!

Citation

BibTeX citation:

@misc{castillo-aguilar2024,
author = {Castillo-Aguilar, Matías},
title = {Markov {Chain} {Monte} {What?}},
date = {2024-04-25},
url = {https://bayesically-speaking.com//posts/2024-04-14 mcmc part 1},
doi = {10.59350/mxfyk-6av39},
langid = {en}
}

]]>intromcmchttps://bayesically-speaking.com/posts/2024-04-14 mcmc part 1/Thu, 25 Apr 2024 03:00:00 GMTWelcome to Bayesically SpeakingMatías Castillo-Aguilar
https://bayesically-speaking.com/posts/2023-05-30 welcome/

Hello stranger

First of all, welcome to the first post of “Bayesically Speaking” (which, in case you haven’t noticed, is a word play between “Basically Speaking” and the (hopefully) well-known Bayes’ theorem), and although the web is offline at the time of writing this article, I find myself following the advice of all those people who encouraged me to trust my instinct and dare to do what I have always wanted: to be able to transmit the thrill of using science as a tool to know and understand the reality that surrounds us and that we perceive in a limited way through our senses.

For years, my interests have revolved around understanding the world through the lens of statistics, particularly as a tool to better understand and quantify the relationships between the moving parts that make up many health outcomes. Another aspect that I find fascinating is how certain variables can go unnoticed when viewed separately, but when viewed together can have radically different behaviors.

Code

sim_data <-simulate_simpson(n =100, difference =2, groups =4, r = .7) |>as.data.table()sim_data[, Group :=factor(Group, levels =c("G_1","G_2","G_3","G_4"),labels =c("Placebo", "Low dose", "Medium dose", "High dose"))]ggplot(sim_data, aes(V1, V2, col = Group)) +geom_point() +geom_smooth(method ="lm") +geom_smooth(method ="lm", aes(group =1, col =NULL)) +scale_color_brewer(type ="qual", palette =2) +labs(x ="Time exposure", y =expression(Delta*"TNF-"*alpha)) +theme_classic() +theme(legend.position ="top")

The statistics toolbox

Within the statistics toolbox, we have commonly used tests like t-tests, ANOVA, correlations, and regression. These methods have their advantages, as they are relatively easy to use and understand. However, like any other tool, they also have their limitations. For instance, they struggle with scenarios involving variables with asymmetric distributions, non-linear relationships, unbalanced groups, heterogeneous variance, extreme values, or repeated measurements with loss of follow-up.

To address these limitations, non-parametric alternatives have been developed. These approaches offer flexibility but make it challenging to extrapolate inferences to new data due to the lack of distributional parameters. Other models, such as neural networks or random forest models, provide assistance when analyzing data with special properties. However, they often sacrifice simplicity and interpretability for increased flexibility and are commonly referred to as “black box” models.

Despite the availability of these alternative methods, there is still a pressing need to incorporate previous knowledge and align with the way human understanding is constructed. As humans, our perception of the world is shaped by experiences and prior beliefs. This is where Bayesian statistics come into play.

Bayesian statistics offer several advantages over classical statistics (also known as “frequentist”). Firstly, they provide a coherent framework for incorporating prior information into our analysis, enabling us to update our beliefs systematically. Additionally, Bayesian statistics allow us to quantify uncertainty through probability distributions, offering a more intuitive and interpretable way to express our findings and the degree of certainty.

The toss of a coin

Let’s consider the following example: Imagine we have a belief that when tossing a coin, there is a higher probability of it landing on heads. Our prior knowledge stems from a previous experiment where, out of 15 coin tosses, 10 resulted in heads. This implies a calculated probability of based on the previous data.

With this information we decide to explore further, we conduct our own experiment. To our astonishment, out of 15 tosses, we observe un unexpected outcome: 13 tails and only 2 heads! This result suggests that the probability of getting heads based solely on our new data is a mere . However, it would be unwise to dismiss the prior evidence in light of these conflicting results. Incorporating these findings into our body of knowledge becomes even more crucial as we strive to gain a deeper understanding of the combined effect.

Computing the posterior

To estimate the posterior probability of getting heads after tossing a coin, we can use the Bayesian framework. Let’s denote the probability of getting heads in a coin toss as .

According to the information provided, we have the prior probability of estimated from an independent experiment as 10 heads out of 15 tosses. This can be written as a Beta distribution:

This symbol “” means distributed as

Here, the Beta distribution parameters are (10, 5) since we had 10 heads and 5 tails in the prior experiment.

Now, a new experiment with the same 15 tosses gives us 2 heads. To update our prior belief, we can use this information to calculate the posterior probability which can be expressed as follow:

This symbol “” means proportional to

Which is equivalent as saying:

To calculate the posterior probability, we need to normalize the product of the likelihood and prior, which involves integrating over all possible values of H. However, in this case, we can use a shortcut because the prior distribution is conjugate to the binomial distribution, so the posterior distribution will also follow a Beta distribution:

About normalization

The product of both the prior and the likelihood maintains the same shape as the final posterior probability distribution, indicated by the “proportional to” () in the previous equation. However, this raw product does not sum up to 1, making it an improper probability density function. To rectify this, the raw product needs to be normalized using integration or simulation in most cases.

After incorporating the data from the new experiment, the parameters of the Beta distribution become (12, 18) since we had 2 heads and 13 tails in the new experiment, meaning 12 heads and 18 tails in total.

About conjugacy

When we choose a Beta distribution as our prior belief and gather new data from a coin toss, an intriguing property emerges: the posterior distribution also follows a Beta distribution. This property, known as conjugacy, offers a valuable advantage by simplifying calculations. It acts as a mathematical shortcut that saves time and effort, making the analysis more efficient and streamlined.

To calculate the posterior probability of getting heads, we can consider the mode (maximum) of the Beta distribution, which is :

Therefore, the posterior probability of getting heads is approximately 39% when we consider all the available evidence.

Code

# Prior and Likelihood functionsdata =function(x, to_log =FALSE) dbeta(x, 2, 13, log = to_log)prior =function(x, to_log =FALSE) dbeta(x, 10, 5, log = to_log)# Posteriorposterior =function(x) { p_fun =function(i) {# Operation is on log-scale merely for computing performance# and minimize rounding errors giving the small nature of# probability density values at each interval. i_log =data(i, to_log =TRUE) +prior(i, to_log =TRUE)# Then transformed back to get probabilities againexp(i_log) }# Then we integrate using base function `integrate` const =integrate(f = p_fun, lower =0L, upper =1L, subdivisions =1e3L,rel.tol = .Machine$double.eps)$valuep_fun(x) / const}## Plotting phase### Color palettecol_pal <-c(Prior ="#DEEBF7", Data ="#3182BD", Posterior ="#9ECAE1")### Main plotting codeggplot() +#### Main probability density functionsstat_function(aes(fill ="Data"), fun = data, geom ="density", alpha =1/2) +stat_function(aes(fill ="Prior"), fun = prior, geom ="density", alpha =1/2) +stat_function(aes(fill ="Posterior"), fun = posterior, geom ="density", alpha =1/2) +#### Minor aesthetics tweakslabs(fill ="", y ="Density", x ="Probability of getting heads") +scale_fill_manual(values = col_pal, aesthetics ="fill") +scale_x_continuous(labels = scales::label_percent(), limits =c(0,1)) +scale_y_continuous(expand =c(0,0), limits =c(0, 6.5)) + see::theme_modern() +theme(legend.position ="top",legend.spacing.x =unit(3, "mm")) +#### Arrowsgeom_curve(aes(x = .81, y =4.1, xend = .69232, yend =3.425), curvature = .4,arrow =arrow(length =unit(1/3, "cm"), angle =20)) +geom_text(aes(x = .9, y =4.1, label ="Beta(10,5)")) +geom_curve(aes(x = .2, y =5.9, xend = .07693, yend =5.45), curvature = .4,arrow =arrow(length =unit(1/3, "cm"), angle =20)) +geom_text(aes(x = .29, y =5.85, label ="Beta(2,13)")) +geom_curve(aes(x = .5, y =5, xend = .3847, yend =4.4), curvature = .4,arrow =arrow(length =unit(1/3, "cm"), angle =20)) +geom_text(aes(x = .55, y =5, label ="≈ 39%"))

Practical implications

This example truly showcases the power of Bayesian statistics, where our prior beliefs are transformed by new evidence, allowing us to gain deeper insights into the world. Despite the unexpected twists and turns, Bayesian inference empowers us to blend prior knowledge with fresh data, creating a rich tapestry of understanding. By embracing the spirit of Bayesian principles, we open doors to the exciting potential of statistics and embark on a captivating journey to unravel the complexities of our reality.

What’s even more fascinating is how closely the Bayesian inference process aligns with our natural way of learning and growing. Just like we integrate prior knowledge, weigh new evidence, and embrace uncertainty in our daily lives, Bayesian inference beautifully mirrors our innate cognitive processes. It’s a dynamic dance of assimilating information, refining our understanding, and embracing the inherent uncertainties of life. This remarkable synergy between Bayesian inference and our innate curiosity has been a driving force behind the rise and success of Bayesian statistics in both theory and practice.

As Gelman and Shalizi (2013) eloquently states, “A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics.”

However, it’s important to acknowledge that while advanced statistical tools offer incredible possibilities, they also come with their own set of limitations. To make the most of these tools, we need to understand their boundaries and make informed choices about their applications.

From past to future

Imagine a time not too long ago when Bayesian statistics were not as prevalent as they are today. The computational challenges posed significant hurdles, limiting our ability to fully embrace their potential. But thanks to the rapid advancement of computing power and simulation techniques, the statistical landscape has undergone a revolution. We now find ourselves in an exciting era where complex Bayesian analysis is accessible to all. It’s like having a superpower in the palm of our hands—an empowering time where our statistical prowess can thrive and conquer new frontiers.

As passionate self-learners on this thrilling statistical journey, even without a formal statistician’s hat, we can’t help but feel an overwhelming excitement to share the vast potential of these tools for unraveling real-world phenomena. Delving into the world of statistics, especially through the lens of Bayesian inference, opens up a universe of captivating possibilities. By melding prior knowledge with fresh evidence and embracing the enigmatic realm of uncertainty, we can uncover profound insights into health, well-being, and the wondrous phenomena that shape our lives.

So, fellow adventurers, let’s ignite our curiosity, embrace our thirst for knowledge, and embark on this exhilarating voyage together. With statistics as our compass, we will navigate the complexities of our reality, expanding our understanding and seizing the extraordinary opportunities that await us.

Get ready to experience a world that’s more vivid, more nuanced, and more awe-inspiring than ever before. Together, let’s dive into the captivating realm of statistics, fueled by enthusiasm and a passion for discovery.

References

Gelman, Andrew, and Cosma Rohilla Shalizi. 2013. “Philosophy and the Practice of Bayesian Statistics.”British Journal of Mathematical and Statistical Psychology 66 (1): 8–38.

Citation

BibTeX citation:

@misc{castillo-aguilar2023,
author = {Castillo-Aguilar, Matías},
title = {Welcome to {Bayesically} {Speaking}},
date = {2023-06-10},
url = {https://bayesically-speaking.com//posts/2023-05-30 welcome},
doi = {10.59350/35tc8-qyj10},
langid = {en}
}