```
data(Indometh)
ggplot(Indometh, aes(time, conc, col = Subject)) +
labs(y = "Plasma levels (mcg/ml)", x = "Time (hours)") +
geom_point(cex = 3) +
geom_smooth(se = FALSE, linewidth = 1)
```

Let’s be honest, being a statistician or a scientist **isn’t just about crunching numbers all day**. It’s more like being a detective, a problem solver, and yeah, throwing in some math for good measure. When we get into **non-linear models**, things really start to get interesting. We’re not just drawing straight lines anymore; we’re wrestling with curves, untangling complicated relationships, and trying to figure out **what’s really going on behind the scenes**.

Take pharmacokinetics, for example. Sounds fancy, right? But at the core, it’s just about what happens to a drug inside the body, and for us, that’s where the real statistical fun begins. Predicting how a **drug’s concentration changes** in the bloodstream over time isn’t just about plotting some points and calling it a day. It’s about **understanding how the data dances with biology**, and then figuring out the best way to describe that dance. And let’s not forget the little thrill of choosing your weapon: Frequentist or Bayesian? It’s like deciding between coffee or mate (and let’s be real, I’m down for both). Each one has its perks, but the choice depends on the situation, and maybe how you’re feeling that day.

In this post, we’re going to roll up our sleeves and dig into **building a non-linear model** to predict how something, let’s say, a drug disappears from the bloodstream. But we’re not stopping there. Nope, we’re also going to throw in a showdown between two big players: the frequentist and Bayesian methods. Think of it like a friendly face-off between two old rivals, each with its own style, strengths, and die-hard fans.

But here’s the thing: this **isn’t just about which method wins on paper**. It’s about the real-life, day-to-day grind of working with data. It’s about those moments when you’re staring at your screen, trying to make sense of a stubborn parameter that just won’t cooperate. It’s about knowing when to trust the numbers and when to rely on your gut, your experience, and maybe even a bit of luck.

So whether you’re a seasoned pro who’s been around the block or someone just dipping their toes into the world of applied stats, this post is for you. We’re going to dive in, compare, and yeah, maybe even have a little fun along the way. Because in the world of science and stats, the journey is half the adventure.

Traditionally, when using frequentist tools for inference, we often focus on estimating a single effect of interest. For example, let’s consider **estimating the plasma concentrations** of the anti-inflammatory drug indomethacin **over several hours**. We observe the following behavior

```
data(Indometh)
ggplot(Indometh, aes(time, conc, col = Subject)) +
labs(y = "Plasma levels (mcg/ml)", x = "Time (hours)") +
geom_point(cex = 3) +
geom_smooth(se = FALSE, linewidth = 1)
```

Even though this is clearly a non-linear problem, just for the sake of learning and illustrating how would it look, we’ll fit a **simple linear model** to the data. For this, even though is a simple linear model, we still need to specify the model we are fitting **to understand what the model is implying** about the relationship:

This is a the standard linear model, in which we are assuming that for each observation , indomethacin plasma levels comes from a normal distribution with mean , and standard deviation . Additionally, the intercept indicates the plasma levels when time is zero. The coefficient indicates the slope at which the pass of time changes the plasma levels in indomethacin, **assuming the change is linear and constant through time** (which by the previous plot, we can assure is not linear).

All models can replicate in some approximate manner the data generation process by which we observe many phenomena in real world scenarios. This is a good thing about writing down your models in equations; it becomes clear **what the data generation process is**, whether is correct or not.

In addition, we need to be very thoughtful about the not-so-obvious implications of a model when trying to force the data to fit the model. That’s why we should always think to **fit our models to the data and not the data to the models**. But, of course, that is harder given that it require of us to think harder about the problems we are intending to solve, instead to only reporting significant results in order to publish.

One thing that you might noted from the previous model notation, is that we have a for each observation but only one for the whole . This implies that the dispersion must be equal for all expected plasma levels , which it could be more or less realistic depending of type the problem you are working with.

Let’s look to the fit of the linear model:

```
lm_freq <- lm(conc ~ 1 + time, data = Indometh)
summary(lm_freq)
```

```
Call:
lm(formula = conc ~ 1 + time, data = Indometh)
Residuals:
Min 1Q Median 3Q Max
-0.5902 -0.2787 -0.1014 0.1579 1.6474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.11820 0.08542 13.090 < 2e-16 ***
time -0.18237 0.02258 -8.077 2.36e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4486 on 64 degrees of freedom
Multiple R-squared: 0.5048, Adjusted R-squared: 0.497
F-statistic: 65.23 on 1 and 64 DF, p-value: 2.361e-11
```

This linear model is telling us that the baseline plasma levels of indomethacin at time 0 () is 1.12 mcg/ml, which not seems very realistic, specially given that initially, most of the subjects had from 1.6 to 2.4 mcg/ml. The slope for time () indicates that for every additional hour, there is a linear decrease in 0.18 mcg/ml in plasma levels of the drug. Let’s see how it looks when we try to fit the model to the original data:

```
lm_predict <- do.call(cbind, predict(lm_freq, se.fit = TRUE,))
ggplot(cbind(Indometh, lm_predict), aes(time, conc)) +
facet_wrap(~ Subject) +
labs(y = "Plasma levels (mcg/ml)", x = "Time (hours)") +
geom_point(aes(col = Subject), cex = 3) +
geom_line(aes(y = fit, col = Subject), linewidth = 1) +
geom_ribbon(aes(ymin = fit + se.fit * qnorm(0.025),
ymax = fit + se.fit * qnorm(0.975),
fill = Subject), alpha = .3)
```

Clearly, our linear model is not doing a good job capturing the non-linear nature of our data (as expected). Up next, we’ll try some non-linear modeling using a more thoughtful approach.

It appears that the concentration of indomethacin decreases according to an **exponential function** of time. We can model this relationship with:

In this model, represents the rate of decay in the plasma concentration of indomethacin (), starting from a baseline concentration (). The term estimates the minimum plasma concentration level over the observed time frame. We assume that the variance is constant across expected indomethacin plasma concentrations () for each subject ().

However, given that we have repeated measurements from multiple individuals, we need to account for variability in the initial plasma concentrations across subjects. To address this, we introduce a random effect () into the model:

Here, represents the deviation from the baseline plasma level () for subject . We keep and fixed because we’re interested in estimating population parameters to describe the pharmacokinetics of the drug.

`nlme`

approachLet’s dive into fitting a non-linear model using the `nlme`

R package, which allows us to specify custom model forms:

```
nlm_freq <- nlme::nlme(
## Model previously described
model = conc ~ alpha + (beta + phi) * exp(lambda * time),
data = Indometh,
## Fixed effects
fixed = alpha + beta + lambda ~ 1,
## Random effects
random = phi ~ 1 | Subject,
## Starting proposal values
start = list(fixed = c(0, 2, -1)))
## Confidence intervals for fixed effects
nlme::intervals(nlm_freq, which = "fixed")
```

```
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
alpha 0.09050133 0.1311638 0.1718262
beta 2.36887258 2.8249635 3.2810543
lambda -1.77509197 -1.6140483 -1.4530047
```

Let’s kick things off by looking at where we stand with indomethacin’s plasma levels. At time zero, we’re seeing baseline concentrations hovering around 2.74 mcg/ml. Not too shabby, right? But this little molecule doesn’t stick around for long—our decay rate clocks in at about , which translates to a swift 80% drop in those levels each hour. Eventually, the levels bottom out at around 0.13 mcg/ml, where the decline takes a bit of a breather.

Now, if you were paying close attention to the code, you might have noticed something interesting: when we’re playing in the frequentist sandbox, particularly with non-linear models, **we’ve got to give the algorithm a little nudge with some starting values**. It’s like setting up the board for a game—these initial values are where the algorithm begins its quest through the likelihood landscape, hunting down the most likely parameters that explain our data. Remember that previous post about Hamiltonian Monte Carlo? Well, this is a bit like rolling a ball down a hill of parameter space, **but here we’re aiming to land on the single spot** that maximizes our chances of observing the data we have.

But enough with the theory, let’s dive back into our non-linear model and see how these predicted plasma levels measure up against the real-world data we’ve got in hand:

```
freq_pred <- predict(nlm_freq)
ggplot(cbind(Indometh, pred = freq_pred), aes(time, conc)) +
facet_wrap(~ Subject) +
labs(y = "Plasma levels (mcg/ml)", x = "Time (hours)") +
geom_point(aes(col = Subject), cex = 3) +
geom_line(aes(y = pred, col = Subject), linewidth = 1)
```

Our model does a decent job fitting the observed data. But what’s the story beyond standard errors? How do we quantify uncertainty in our model parameters? What’s the likelihood of observing a specific decay rate or baseline level? With frequentist methods, our insights are somewhat limited to point estimates and standard errors. We need a broader view.

`brms`

approachTo harness the power of the Bayesian framework, we need to not only define our model but also incorporate prior beliefs about the parameters. Let’s revisit our parameters:

- : Minimum plasma levels of the drug.
- : Baseline plasma levels at time zero.
- : Subject-specific deviation from the population .
- : The amount of exponential decay.

We’ll assign prior distributions based on prior knowledge and results from our frequentist model. Here’s the prior setup:

- For , we’ll use a normal distribution centered around 0.1 mcg/ml with a standard deviation of 0.5 mcg/ml. We’ll truncate this prior at zero, since negative plasma levels aren’t physically meaningful.
- For , we’ll specify a normal distribution centered around 2.5 mcg/ml with a standard deviation of 3 mcg/ml to avoid overly restricting the parameter space. We’ll also set a lower bound of zero.
- For , we’ll use a normal prior centered on 0.5 mcg/ml with a moderate standard deviation of 2 mcg/ml to capture variability around baseline levels.
- For , we’ll set a weakly informative prior centered around -1 with a standard deviation of 3. This reflects our expectation of a negative decay rate, with the upper bound fixed at zero to prevent increases in plasma levels.

The priors for our model parameters are:

So now that we have what we need we can already proceed to fit our bayesian non-linear model:

```
nlme_brms <- brm(
## Formula
formula = bf(conc ~ alpha + (beta + phi) * exp(lambda * time),
alpha + beta + lambda ~ 1, phi ~ 1 | Subject,
nl = TRUE),
data = Indometh,
## Priors
prior = prior(normal(0.1, 0.5), nlpar = "alpha", lb = 0) +
prior(normal(2.5, 3.0), nlpar = "beta", lb = 0) +
prior(normal(0.5, 2.0), nlpar = "phi") +
prior(normal(-1.0, 3.0), nlpar = "lambda", ub = 0),
## MCMC hyperparameters
chains = 5, iter = 4000,
warmup = 2000, cores = 5,
## More flexible exploration parameters
control = list(adapt_delta = 0.99,
max_treedepth = 50),
## For reproducibility
seed = 1234, file = "nlme_brms.RDS"
)
fixef(nlme_brms)
```

```
Estimate Est.Error Q2.5 Q97.5
alpha_Intercept 0.1346163 0.02173654 0.09162845 0.1768141
beta_Intercept 2.6519527 1.44843180 0.24992601 5.7345653
lambda_Intercept -1.6448947 0.09467393 -1.83556844 -1.4659436
phi_Intercept 0.2150963 1.44471628 -2.84981889 2.6512578
```

In this Bayesian model, we get not just point estimates but full distributions for each parameter. This approach allows us to explore the probable range of parameter values and answer probabilistic questions. But first, let’s see how well our Bayesian model fits the observed data:

```
bmrs_pred <- predict(nlme_brms)
ggplot(cbind(Indometh, bmrs_pred), aes(time, conc)) +
facet_wrap(~ Subject) +
labs(y = "Plasma levels (mcg/ml)", x = "Time (hours)") +
geom_point(aes(col = Subject), cex = 3) +
geom_line(aes(y = Estimate, col = Subject), linewidth = 1) +
geom_ribbon(aes(ymin = Q2.5, ymax = Q97.5, fill = Subject), alpha = .3)
```

The Bayesian model’s fitted effects align nicely with the observed data, and the uncertainty around the expected plasma concentrations is well-represented. To explore the range of parameters compatible with our data, we can **plot the posterior distributions**:

`plot(nlme_brms, variable = "^b_", regex = TRUE)`

These plots reveal a spectrum of parameter values that fit the observed data. For the decay parameter , we can expect a 78% () to 83% () decrease in indomethacin plasma concentrations per hour.

We can further explore the posterior distributions. For example, transforming the parameter into a percentage decay scale:

```
posterior_dist <- as_draws_df(nlme_brms, variable = "b_lambda_Intercept")
posterior_dist$prob_decay <- (1 - exp(posterior_dist$b_lambda_Intercept))
ggplot(posterior_dist, aes(x = prob_decay)) +
tidybayes::stat_halfeye(fill = "lightblue") +
labs(y = "Density", x = "Decay of plasma levels per hour (%)") +
scale_x_continuous(labels = scales::label_percent(), n.breaks = 8)
```

This flexibility in the Bayesian framework allows us to interpret the decay rate in more intuitive terms, reflecting a range of plausible rates consistent with our data. We can now communicate the percent decay of indomethacin plasma levels in a more accessible manner, considering the variability captured by our model.

Now that we’ve implemented linear, non-linear, and Bayesian non-linear models, it’s time to compare their performances. It’s important to remember that each model has its own set of performance metrics, which can make direct comparisons tricky. However, by calculating the root mean square error (RMSE), we can get a sense of the **average error each model makes when predicting plasma levels**. RMSE gives us a tangible measure of error on the same scale as our predictor, helping us gauge how well each model is performing:

```
data.frame(
lm = performance::performance_rmse(lm_freq),
nlme = performance::performance_rmse(nlm_freq),
brms = performance::performance_rmse(nlme_brms)
)
```

```
lm nlme brms
1 0.4417804 0.1080546 0.1076928
```

Here, we can see that **both the frequentist and Bayesian non-linear models outperformed the simple linear model** by a significant margin. The lower RMSE values indicate a better overall fit. Interestingly, the Bayesian model edged out the frequentist model by a tiny margin, with a difference of just 0.000402 mcg/ml in RMSE (). Given that the standard deviation of the plasma levels is 0.63 mcg/ml, this difference is practically negligible and unlikely to be meaningful in a real-world context.

After fitting three different models (two of which were non-linear) it becomes apparent that model selection isn’t just a mechanical task; **it’s a deliberate and thoughtful process that requires us to deeply consider the scientific meaning behind the equations we use**. This is where the art of modeling comes into play. Simply plugging numbers into an algorithm can give you results, but those results are only as meaningful as the assumptions underpinning them.

When we lay down equations, we’re not just scribbling mathematical symbols; we’re making statements about how we believe the world works. In the case of indomethacin, **our model reflects the assumption that its plasma levels decay exponentially over time**, a reasonable assumption based on pharmacokinetic principles. But it’s also crucial to remember that every equation carries with it a set of assumptions, some explicit and others hidden beneath layers of mathematical complexity. Recognizing and challenging these assumptions is where the real insight lies.

Historically, the evolution of statistical modeling has been shaped by this very tension between complexity and clarity. Think back to the early days of regression analysis in the 19th century when pioneers like Francis Galton and Karl Pearson were laying the groundwork for modern statistics. They weren’t just crunching numbers—they were developing tools to describe the relationships they observed in the natural world. Their work was as much about understanding the data as it was about interpreting the equations they devised.

Fast forward to today, and the principles remain the same, even as our tools have become more sophisticated. The move from linear to non-linear models mirrors our growing understanding of the world’s complexities. We know now that relationships aren’t always straight lines; they can bend, twist, and curve in ways that linear models can’t capture. Yet, **with this power comes responsibility**. Non-linear models, while more flexible, **also require us to be more vigilant about the assumptions we’re making**.

As we’ve seen with our indomethacin example, the choice of model has a profound impact on the conclusions we draw. The frequentist and Bayesian approaches each bring their own perspectives, but **neither is inherently superior**. Instead, the best approach depends on the context of the problem and the nature of the data. In the end, the goal isn’t to find the “right” model but **to find a model that best captures the underlying reality we’re trying to understand**.

So, as we continue to develop and apply statistical models, let’s do so with a sense of curiosity and caution. Let’s appreciate the historical context that brought us here, and let’s always be aware of the assumptions we’re making—because in the world of statistics, those assumptions shape the stories we tell.

BibTeX citation:

```
@misc{castillo-aguilar2024,
author = {Castillo-Aguilar, Matías},
title = {Non-Linear Models: {Pharmacokinetics} and {Indomethacin}},
date = {2024-08-20},
url = {https://bayesically-speaking.com/posts/2024-08-05 non-linear-models part 1/},
doi = {10.59350/r62fn-8h720},
langid = {en}
}
```

For attribution, please cite this work as:

Castillo-Aguilar, Matías. 2024. “Non-Linear Models:
Pharmacokinetics and Indomethacin.” August 20, 2024. https://doi.org/10.59350/r62fn-8h720.

Hey there, fellow science enthusiasts and stats geeks! Welcome back to the wild world of Markov Chain Monte Carlo (MCMC) algorithms. This is **part two** of my series on the **powerhouse behind Bayesian Inference**. If you missed the first post, no worries! Just hop on over here and catch up before we dive deeper into the MCMC madness. Today, we’re exploring the notorious **Hamiltonian Monte Carlo (HMC)**, a special kind of MCMC algorithm that taps into the dynamics of **Hamiltonian mechanics**.

Hold up, did you say Hamiltonian mechanics? What in the world do mechanics and physics have to do with Bayesian stats? I get it, it sounds like a mashup of your wildest nightmares. But trust me, this algorithm sometimes feels **like running a physics simulation in a statistical playground**. Remember our chat from the last post? In Bayesian stats, **we’re all about estimating the shape of a parameter space**, aka the posterior distribution.

Picture this: You drop a tiny particle down a cliff, and it rolls naturally along the landscape’s curves and slopes. Easy, right? **Now, swap out the real-world terrain for a funky high-dimensional probability function**. That same little particle? It’s cruising through this wild statistical landscape like a boss, all thanks to the rules of **Hamiltonian mechanics**.

About the animation

The previous animation illustrate the **Hamiltonian dynamics of a particle traveling** a two-dimensional parameter space. The code for this animation is borrowed from Chi Feng’s github. You can find the original repository with corresponding code here: https://github.com/chi-feng/mcmc-demo

Let’s break down Hamiltonian dynamics **in terms of position and momentum** with a fun scenario: **Imagine you’re on a swing**. When you hit the highest point, you slow down, right? **Your momentum’s almost zero**. But here’s the kicker: You know **you’re about to pick up speed on the way down**, gaining momentum in the opposite direction. That moment when you’re at the top, almost motionless? That’s when you’re losing **kinetic energy** and gaining **potential energy**, thanks to gravity getting ready to pull you back down.

So, in this analogy, **when your kinetic energy** (think swing momentum) goes up, **your potential energy** (like being at the bottom of the swing) **goes down**. And vice versa! When your kinetic energy drops (like when you’re climbing back up), your potential energy shoots up, waiting for gravity to do its thing.

This **energy dance** is captured by the **Hamiltonian** (), which sums up the **total energy in the system**. It’s the sum of kinetic energy () and potential energy ():

At its core, Hamiltonian Monte Carlo (HMC) borrows from **Hamiltonian dynamics**, a fancy term for the rules that govern **how physical systems evolve in phase space**. In Hamiltonian mechanics, a system’s all about its position () and momentum (), and **their dance is choreographed by Hamilton’s equations**. Brace yourself, things are about to get a little mathy:

Okay, I know **Hamiltonian dynamics** can be a real **brain-buster** — trust me, it took me a hot minute to wrap my head around it. But hey, I’ve got an analogy that might just make it click. Let’s revisit **our swing scenario**: remember our picture of a kid on a swing, right? **The swing’s angle** from the vertical () tells us **where the kid is**, and momentum () is **how fast** the swing’s moving.

Now, let’s break down those equations:

This one’s like peeking into the future to see **how the angle** () **changes over time**. And guess what? **It’s all about momentum** (). The faster the swing’s going, the quicker it swings back and forth — simple as that!

Next up:

Now, this beauty tells us **how momentum** () **changes over time**. It’s all about the energy game here — specifically, **how the swing’s position** () **affects its momentum**. When the swing’s at the highest point, gravity’s pulling hardest, ready to send him back down.

So, picture this:

- The kid swings forward, so the angle () goes up thanks to the momentum () building until bam — top of the swing.
- At the top, the swing’s momentarily still, but gravity’s pulling to send him flying back down — hence, he is accumulating potential energy.
- Zoom! Back down it goes, picking up speed in the opposite direction — and so, the potential energy is then transferred into kinetic energy.

All the while, the Hamiltonian () is keeping tabs **on the swing’s total energy** — whether it’s zooming at the bottom (high kinetic energy , as a function of momentum ) or pausing at the top (high potential energy , as a function of position ).

This **dance between kinetic and potential energy** is what we care within **Hamiltonian mechanics**, and also what we mean when we refer to the *phase space*, which it’s nothing more than **the relationship between position and momentum**.

Okay, I know we’re diving into some **physics territory** here **in a stats blog**, but trust me, understanding these concepts **is key to unlocking what HMC’s all about**. So, let’s take a little side trip and get a feel for Hamilton’s equations with a different example. **Check out the gif below** — see that weight on a string? It’s doing this cool back-and-forth dance thanks to the tug-of-war between the string pulling up and gravity pulling down.

Now, let’s get a little hands-on with some code. We’re gonna simulate **a simple harmonic oscillator** — you know, like that weight on a string — and **watch how it moves through phase space**.

```
# Define the potential energy function (U) and its derivative (dU/dq)
U <- function(q) {
k <- 1 # Spring constant
return(0.5 * k * q^2)
}
dU_dq <- function(q) {
k <- 1 # Spring constant
return(k * q)
}
# Kinetic energy (K) used for later
K <- function(p, m) {
return(p^2 / (2 * m))
}
# Introduce a damping coefficient
b <- 0.1 # Damping coefficient
# Set up initial conditions
q <- -3.0 # Initial position
p <- 0.0 # Initial momentum
m <- 1.0 # Mass
# Time parameters
t_max <- 20
dt <- 0.1
num_steps <- ceiling(t_max / dt) # Ensure num_steps is an integer
# Initialize arrays to store position and momentum values over time
q_values <- numeric(num_steps)
p_values <- numeric(num_steps)
# Perform time integration using the leapfrog method
for (i in 1:num_steps) {
# Store the current values
q_values[i] <- q
p_values[i] <- p
# Half step update for momentum with damping
p_half_step <- p - 0.5 * dt * (dU_dq(q) + b * p / m)
# Full step update for position using the momentum from the half step
q <- q + dt * (p_half_step / m)
# Another half step update for momentum with damping using the new position
p <- p_half_step - 0.5 * dt * (dU_dq(q) + b * p_half_step / m)
}
```

```
harmonic_data <- data.table(`Position (q)` = q_values, `Momentum (p)` = p_values)
ggplot(harmonic_data, aes(`Position (q)`, `Momentum (p)`)) +
geom_path(linewidth = .7) +
geom_point(size = 2) +
geom_point(data = harmonic_data[1,], col = "red", size = 3) +
labs(title = "Phase Space Trajectory of Hamiltonian Dynamics",
subtitle = "Accounting for Energy Loss") +
theme_classic(base_size = 20)
```

Now, take a look at that graphic. See how the **position** () is all about **where the oscillator’s** hanging out, and **momentum** ()? Well, that’s just **how fast the weight’s swinging**. Put them together, and you’ve got what we call the *phase space* — basically, it’s like peeking into the dance floor of these mechanical systems through the lenses of Hamiltonian dynamics.

Now, in a perfect world, there’d be **no energy lost over time**. But hey, we like to keep it real, so we added a little something called **damping effect** — think of it **like energy leaking out of the system over time**. In the real world, that makes sense, but **in our statistical playground, we want to keep that energy** locked in tight. After all, losing energy means we’re losing precious info about our target distribution, and nobody wants that.

```
hamiltonian <- harmonic_data[, list(`Total energy` = U(`Position (q)`) + K(`Momentum (p)`, m),
`Kinetic energy` = K(`Momentum (p)`, m),
`Potential energy` = U(`Position (q)`))]
hamiltonian <- melt(hamiltonian, measure.vars = c("Total energy", "Kinetic energy", "Potential energy"))
ggplot(hamiltonian, aes(rep(1:200, times = 3), value, col = variable)) +
geom_line(linewidth = 1) +
labs(y = "Energy", col = "Variable", x = "Time",
title = "Fluctuation of the Total Energy in the Oscillator",
subtitle = "As a Function of Kinetic and Potential Energy") +
scale_color_brewer(type = "qual", palette = 2) +
theme_classic(base_size = 20) +
theme(legend.position = "top")
```

So, what’s the big takeaway here? Well, whether it’s a **ball rolling down a hill** or a sampler hunting for **model coefficients**, this framework’s got us covered. In Bayesian land, think of **our model’s parameters as position coordinates** in some space, and is the momentum helping us navigate the twists and turns of this parameter space. And with Hamiltonian dynamics leading the way, we’re guaranteed to find our path through this statistical dimension, one step at a time.

Now that we got some intuition about Hamiltonian mechanics, it’s time we build our own HMC sampler. To accomplish this, imagine that we’re diving into some data to figure out **how an independent variable** () **and a dependent variable** () are related. We’re talking about **linear regression** — you know, trying to draw a line through those scattered data points to make sense of it all. But hey, **this is Bayesian territory**, so we’re not just throwing any ol’ line on that plot. No, sir! We’re exploring the parameter space of those regression coefficients — that’s the slope and intercept. Then, when the data rolls in, we’re smashing together that likelihood and those priors **using Bayes’ theorem** to cook up a posterior distribution — **a fancy way of saying** our updated beliefs **about those coefficients** after we’ve seen the data.

But before we dive into the statistical kitchen, **let’s whip up some synthetic data**. Picture this: we’re mimicking a real-world scenario where relationships between variables are as murky as a foggy morning. So, we’re gonna **conjure up a batch of data** with a simple linear relationship, jazzed up with a sprinkle of noise. Oh, and **let’s keep it small** — just 20 subjects, ’cause, hey, **science “loves” a manageable sample size**.

```
set.seed(80) # Set seed for reproducibility
# Define the number of data points and the range of independent variable 'x'
n <- 20
x <- seq(1, 10, length.out = n)
```

Okay, so now let’s get down to business and fit ourselves a nice, cozy **linear relationship** between an independent variable () and a dependent variable (). We’re talking about laying down **a straight line** that best describes how changes with .

So, what’s our equation look like? Well, it’s pretty simple:

Hold on, let’s break it down. We’re saying that each value () is chillin’ around a mean (), like a bunch of friends at a party. And guess what? **They’re all acting like good ol’ normal folks**, hanging out with a variance () that tells us **how spread out they are**. Now, the cool part is how we define . It’s **just a simple sum** of an intercept () and the slope () times . Think of it like plotting points on graph paper — each tells us where we are on the -axis, and multiplying by **gives us the corresponding height on the line**.

Now, let’s talk numbers. We’re **setting** **to 2** because, hey, every relationship needs a starting point, right? And for , **we’re going with 3** — **that’s the rate of change** we’re expecting for every unit increase in . Oh, and let’s not forget about — that’s just a fancy way of saying how much our values **are allowed to wiggle around**.

```
# Define the true parameters of the linear model and the noise level
true_intercept <- 2
true_slope <- 3
sigma <- 5
```

With our model all set up, **it’s time to create some data points** for our variable. We’ll do this using the `rnorm()`

function, which is like a magical data generator for **normally distributed variables**.

```
# Generate the dependent variable 'y' with noise
mu_i = true_intercept + true_slope * x
y <- rnorm(n, mu_i, sigma)
```

Alright, now that we’ve got our hands on some data, it’s time to dive into the **nitty-gritty of Bayesian stuff**. First up, we’re gonna need our trusty **log likelihood function** for our **linear model**. This function’s like the Sherlock Holmes of statistics — it figures out **the probability of seeing our data** given a specific set of parameters (you know, the intercept and slope we’re trying to estimate).

```
# Define the log likelihood function for linear regression
log_likelihood <- function(intercept, slope, x, y, sigma) {
# We estimate the predicted response
y_pred <- intercept + slope * x
# Then we see how far from the observed value we are
residuals <- y - y_pred
# Then we estimate the likelihood associated with that error from a distribution
# with no error (mean = 0)
# (this is the function that we are trying to maximize)
log_likelihood <- sum( dnorm(residuals, mean = 0, sd = sigma, log = TRUE) )
return(log_likelihood)
}
```

So, what’s the deal with **priors**? Well, think of them as the background music to our data party. **They’re like our initial hunches** about what the parameters could be before we’ve even glanced at the data. To keep things simple, **we’ll go with flat priors** — no favoritism towards any particular values. It’s like saying, “Hey, let’s give everyone a fair shot!”

```
# Define the log prior function for the parameters
log_prior <- function(intercept, slope) {
# Assuming flat priors for simplicity
# (the log of 0 is 1, so it has no effect)
return(0)
}
```

Now, here’s where the real magic kicks in. **We bring our likelihood and priors together** in a beautiful dance to reveal the superstar of Bayesian statistics — **the posterior distribution!** This bad boy tells us everything we wanna know **after we’ve taken the data into account**. This allow us to take into account previous knowledge — like past research, and the observed data — let’s say, samples from a new experiment.

The posterior it’s nothing more than the **probability associated with our parameters** of interest (aka. the slope and intercept), **given the observed data**. We represent this posterior distribution as the following:

Which **is the same** as saying:

```
# Combine log likelihood and log prior to get log posterior
log_posterior <- function(intercept, slope, x, y, sigma) {
return(log_likelihood(intercept, slope, x, y, sigma) + log_prior(intercept, slope))
}
```

Alright, now that we have everything ready, let’s dig into the guts of HMC and how we put it to work. Remember how HMC **takes inspiration from the momentum and potential energy** dance in physics? Well, in practice, it’s like having a GPS for our parameter space, guiding us to new spots that are more likely than others.

But here’s the thing: **our parameter space isn’t some smooth highway** we can cruise along endlessly. Nope, it’s more like a rugged terrain full of twists and turns. So, how do we navigate this space? Enter the **leapfrog integration method**, the backbone of HMC.

So, **leapfrog integration** is basically this cool math trick we use in HMC **to play out how a system moves over time** using discrete steps, **so we don’t have to compute every single** value along the way. This integration method also is advantageous by not allowing energy leaks out of the system — remember the Hamiltonian? — which is super important if you don’t want to get stuck in this statistical dimension mid exploration.

Here’s how it works in HMC-lingo: we use leapfrog integration **to move around the parameter space in discrete steps** — rather than sliding through a continuum, and **grabbing samples from the posterior distribution**. The whole process goes like this:

- We give the
**momentum**a little nudge by leveraging on the**gradient info**(). The**gradient or slope**of the position () in this parameter space will**determine by how much our momentum will change**. Like when we are in the top position in the swing — the potential energy then transfers to kinetic energy (aka. momentum). - We adjust the
**position**(or parameters) based on the**momentum**boost. - Then we
**update the momentum**based on the gradient on that**new position**. - We repeat the cycle for as many “jumps” we are doing, for each sample of the posterior we intend to draw.

**Picture a frog hopping from one lily pad to another** — that’s where the name “leapfrog” comes from. It helps us **explore new spots** in the parameter space by **discretizing the motion** of this imaginary particle by using Hamiltonian dynamics, using the slope information () of the current position to gain/loss momentum and move to another position in the parameter space.

We prefer leapfrogging over simpler methods like Euler’s method because it **keeps errors low**, both locally and globally. This stability is key, especially when we’re dealing with big, complicated systems. Plus, **it’s a champ at handling high-dimensional spaces**, where keeping energy in check is a must for the algorithm to converge.

Now, to get our HMC sampler purring like a kitten, we’ve got to **fine-tune a few gears**. Think of these as the knobs and dials on your favorite sound system – adjust them just right, and you’ll be **grooving to the perfect beat**.

First up, we’ve got the **number of samples**. This determines how many times our sampler will take a peek at the parameter space before calling it a day.

Next, we’ve got the **step size** (). Imagine this as the stride length for our leapfrog integrator. Too short, and we’ll be tiptoeing; too long, and we’ll be taking giant leaps – neither of which gets us where we want to go. It’s all about finding that sweet spot.

Then, there’s the **number of steps for the leapfrog** to make. Too few, and we risk missing key spots; too many, and we might tire ourselves out.

Lastly, we need **an initial guess for the intercept and slope**. This is like dropping a pin on a map – it gives our sampler a starting point to begin its journey through the parameter space.

```
# Initialization of the sampler
num_samples <- 5000 # Number of samples
epsilon <- 0.05 # Leapfrog step size
num_steps <- 50 # Number of leapfrog steps
init_intercept <- 0 # Initial guess for intercept
init_slope <- 0 # Initial guess for slope
# Placeholder for storing samples
params_samples <- matrix(NA, nrow = num_samples, ncol = 2)
```

Alright, let’s fire up this Hamiltonian engine and get this party started. Here’s the game plan:

**Create a Loop:**We’ll set up a loop to simulate our little particle moving around the parameter space.**Integrate Its Motion:**Using our trusty leapfrog integrator, we’ll keep track of how our particle moves.**Grab a Sample:**Once our particle has finished its dance, we’ll grab a sample at its final position.**Accept or Reject:**We’ll play a little game of accept or reject – if the new position looks promising, we’ll keep it; if not, we’ll stick with the old one. It’s like Tinder for parameters.**Repeat:**We’ll rinse and repeat this process until we’ve collected as many samples as we need.

Now, **to kick things off**, we’ll give our imaginary particle **a random speed and direction** to start with, and drop it down somewhere in the parameter space. This initial kick sets the stage for our parameter exploration, the rest is up to the physics.

```
for (i in 1:num_samples) {
# Start with a random momentum of the particle
momentum_current <- rnorm(2)
# And set the initial position of the particle
params_proposed <-c(init_intercept, init_slope)
# Next, we will simulate the particle's motion using
# leapfrog integration.
...
```

Now that our imaginary particle is all geared up with an **initial momentum and position**, it’s time to let it loose and see where it goes in this **parameter playground**. We know we’re using Hamiltonian mechanics, but to make it computationally feasible, we’re bringing in our trusty leapfrog integrator. We’ve already seen that this bad boy **discretizes the motion of our particle**, making it manageable to track its journey without breaking our computers.

So, here’s the lowdown on what our leapfrog integrator is up to:

**Estimating the Slope:**We start off with an initial position and estimate the slope of the terrain at that point.

**Adjusting Momentum:**This slope is like the potential energy, dictating whether our particle speeds up or slows down. So, we tweak the momentum accordingly based on this slope.

**Taking a Step:**With this momentum tweak, we move the particle for a set distance , updating its position from the starting point.

**Repeat:**Now that our particle has a new spot, we rinse and repeat – estimating the slope and adjusting momentum.

**Keep Going:**We keep this cycle going for as many steps as we want to simulate, tracking our particle’s journey through the parameter space.

```
for (j in 1:num_steps) {
# Gradient estimation
grad <- c(
# Gradient of the intercept of X: -sum(residuals / variance)
...
# Gradient of the slope of X: -sum(residuals * X / variance)
...
)
# Momentum half update
momentum_current <- momentum_current - epsilon * grad * 0.5
# Full step update for parameters
params_proposed <- params_proposed + epsilon * momentum_current
# Recalculate gradient for another half step update for momentum
grad <- c(
# Gradient of the intercept of X: -sum(residuals / variance)
...
# Gradient of the slope of X: -sum(residuals * X / variance)
...
)
# Final half momentum update
momentum_current <- momentum_current - epsilon * grad * 0.5
}
```

After our particle has taken its fair share of steps through the parameter space, it’s decision time – **should we accept or reject its proposed new position?** We can’t just blindly accept every move it makes; we’ve got to be smart about it.

That’s where the **Metropolis acceptance criteria** come into play. This handy rule determines whether a proposed new position **is a good fit or not**. The idea is to weigh the probability of the new position against the probability of the current one. **If the new spot looks promising, we’ll move there with a certain probability**, ensuring that our samples accurately reflect the shape of the distribution we’re exploring. But if it’s not a better fit, we’ll stick with where we are.

The **formula** for this acceptance probability () when transitioning from the current position () to a proposed position () is straightforward:

Here, is the probability density of the **proposed position** , and is the probability density of the **current position** . We’re essentially **comparing the fitness of the proposed spot against where we’re currently at**. If the proposed position offers a higher probability density, we’re more likely to accept it. This ensures that our samples accurately represent the target distribution.

However, when dealing with very small probability values, **we might run into numerical underflow issues**. That’s where using the **log posterior** probabilities comes in handy. By taking the logarithm of the probabilities, we convert the ratio into a difference, making it **easier to manage**. Here’s how the acceptance criteria look with logarithms:

**This formulation is equivalent to the previous one** but helps us avoid numerical headaches, especially when working with complex or high-dimensional data. We’re still comparing the fitness of the proposed position with our current spot, just in a more **log-friendly way**.

```
# Calculate log posteriors and acceptance probability
log_posterior_current <- log_posterior( ...current parameters... )
log_posterior_proposed <- log_posterior( ...proposed parameters... )
alpha <- min(1, exp(log_posterior_proposed - log_posterior_current))
# Accept or reject the proposal
if (runif(1) < alpha) {
init_intercept <- params_proposed[1]
init_slope <- params_proposed[2]
}
```

Now that we’ve broken down each piece of our HMC puzzle, it’s time **to put them all together** and see how the full algorithm works.

```
for (i in 1:num_samples) {
# Randomly initialize momentum
momentum_current <- rnorm(2)
# Make a copy of the current parameters
params_proposed <- c(init_intercept, init_slope)
# Perform leapfrog integration
for (j in 1:num_steps) {
# Half step update for momentum
grad <- c(
-sum((y - (params_proposed[1] + params_proposed[2] * x)) / sigma^2),
-sum((y - (params_proposed[1] + params_proposed[2] * x)) * x / sigma^2)
)
momentum_current <- momentum_current - epsilon * grad * 0.5
# Full step update for parameters
params_proposed <- params_proposed + epsilon * momentum_current
# Recalculate gradient for another half step update for momentum
grad <- c(
-sum((y - (params_proposed[1] + params_proposed[2] * x)) / sigma^2),
-sum((y - (params_proposed[1] + params_proposed[2] * x)) * x / sigma^2)
)
momentum_current <- momentum_current - epsilon * grad * 0.5
}
# Calculate the log posterior of the current and proposed parameters
log_posterior_current <- log_posterior(init_intercept, init_slope, x, y, sigma)
log_posterior_proposed <- log_posterior(params_proposed[1], params_proposed[2], x, y, sigma)
# Calculate the acceptance probability
alpha <- min(1, exp(log_posterior_proposed - log_posterior_current))
# Accept or reject the proposal
if (runif(1) < alpha) {
init_intercept <- params_proposed[1]
init_slope <- params_proposed[2]
}
# Save the sample
params_samples[i, ] <- c(init_intercept, init_slope)
}
```

Alright, folks, let’s wrap this up with a peek at how our little algorithm fared in estimating the true intercept and slope of our linear model.

```
colnames(params_samples) <- c("Intercept", "Slope")
posterior <- as.data.table(params_samples)
posterior[, sample := seq_len(.N)]
melt_posterior <- melt(posterior, id.vars = "sample")
ggplot(melt_posterior, aes(sample, value, col = variable)) +
facet_grid(rows = vars(variable), scales = "free_y") +
geom_line(show.legend = FALSE) +
geom_hline(data = data.frame(
hline = c(true_intercept, true_slope),
variable = c("Intercept", "Slope")
), aes(yintercept = hline), linetype = 2, linewidth = 1.5) +
labs(x = "Samples", y = NULL,
title = "Convergence of parameter values",
subtitle = "Traceplot of both the Intercept and Slope") +
scale_color_brewer(type = "qual", palette = 2) +
scale_y_continuous(n.breaks = 3) +
scale_x_continuous(expand = c(0,0)) +
theme_classic(base_size = 20) +
theme(legend.position = "top")
```

In this plot, we’re tracking the **evolution of the intercept and slope parameters over the course of our sampling process**. Each line represents a different sample from the posterior distribution, showing how these parameters fluctuate over time. The dashed lines mark the **true values** of the **intercept** and **slope** that we used to generate the data. Ideally, we’d like to see the samples converging around these true values, **indicating that our sampler is accurately capturing the underlying structure** of the data.

```
ids <- posterior[,sample(sample, size = 200)]
ggplot() +
geom_abline(slope = posterior$Slope[ids], intercept = posterior$Intercept[ids], col = "steelblue", alpha = .5) +
geom_abline(slope = true_slope, intercept = true_intercept, col = "white", lwd = 1.5) +
geom_point(aes(x, y), size = 4) +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
labs(title = "Data with Regression Line",
subtitle = "True and HMC-estimated parameter values") +
theme_classic(base_size = 20)
```

This plot gives us a bird’s-eye view of our data, overlaid with both the **true regression line** (in white) and the **estimated regression lines** from our HMC sampler (in blue). The true regression line represents the ground truth relationship between our independent and dependent variables, while **the estimated regression lines are sampled from the accepted values** of the intercept and slope parameters sampled from the posterior distribution. By comparing these two, we can assess **how well our model has captured the underlying trend** in the data.

```
ggplot(posterior, aes(Slope, Intercept)) +
geom_density2d_filled(contour_var = "density", show.legend = FALSE) +
geom_hline(yintercept = true_intercept, linetype = 2, col = "white") +
geom_vline(xintercept = true_slope, linetype = 2, col = "white") +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0)) +
scale_fill_viridis_d(option = "B") +
theme_classic(base_size = 20)
```

In this plot, we’re visualizing the **joint posterior density** of the **intercept** and **slope** parameters sampled from our model. The contours represent regions of higher density, with **brighter zones indicating areas where more samples are concentrated**. The white dashed lines mark the true values of the intercept and slope used to generate the data, providing a reference for comparison. Ideally, we’d like to see **the contours align closely with these true values**, indicating that our sampler has accurately captured the underlying distribution of the parameters.

So, let’s take a moment to marvel at the marvels of Hamiltonian Monte Carlo (HMC). It’s not every day you see **physics rubbing shoulders with statistics**, but here we are, with HMC straddling both worlds like a boss.

What makes HMC so darn fascinating is how it **borrows tools from physics to tackle complex problems in statistics**. It’s like watching two old pals team up to solve a mystery, each bringing their own unique skills to the table. With HMC, we’re not just crunching numbers; we’re **tapping into the underlying principles that govern the physical world**. I mean, seriously, it gives me goosebumps just thinking about it.

But it’s not just HMC that’s shaking up the stats scene. Nope, the whole world of science is pivoting towards **simulation-based statistics** and **Bayesian methods** faster than you can say “p-value.” Why? Because in today’s data-rich landscape, traditional methods just can’t keep up. We need tools like HMC **to navigate the choppy waters of high-dimensional data**, to tease out the subtle patterns hiding in the noise.

Now, here’s the kicker: **understanding how HMC works isn’t just some academic exercise**. Oh no, it’s the key to unlocking the true power of Bayesian inference. Sure, **you can run your models without ever peeking under the hood**, but where’s the fun in that? Knowing how HMC works gives you this intuitive grasp of what your model is up to, what might be tripping it up, and how to **steer it back on course** when things go sideways.

So, here’s to HMC, one of the **big algorithms of modern statistics**, blurring the lines between physics and stats, and paving the way for a brave new world of **simulation-based inference**. Cheers to the leapfrogging pioneers of Bayesian stats, charting new territories and uncovering hidden truths, *sliding one simulated step at a time*.

BibTeX citation:

```
@misc{castillo-aguilar2024,
author = {Castillo-Aguilar, Matías},
title = {The {Good,} {The} {Bad,} and {Hamiltonian} {Monte} {Carlo}},
date = {2024-05-15},
url = {https://bayesically-speaking.com/posts/2024-05-15 mcmc part 2/},
doi = {10.59350/fa26y-xa178},
langid = {en}
}
```

For attribution, please cite this work as:

Castillo-Aguilar, Matías. 2024. “The Good, The Bad, and
Hamiltonian Monte Carlo.” May 15, 2024. https://doi.org/10.59350/fa26y-xa178.

Alright, folks, let’s dive into the wild world of statistics and data science! Picture this: you’re knee-deep in data, trying to make sense of the chaos. But here’s the kicker, sometimes the chaos is just **too darn complex**. With tons of variables flying around, getting a grip on uncertainty can feel like **trying to catch smoke with your bare hands**.

Please, have in your consideration that the kind of problems that we’re dealing with, it’s not solely related to the number of dimensions, it’s mostly related to **trying to estimate something that we can’t see in full beforehand**. For instance, consider the following banana distribution (shown below). How could we map this simple two dimensional surface without computing it all at once?

```
dbanana <- function(x) {
a = 2;
b = 0.2;
y = x / a
y = (a * b) * (x^2 + a^2)
}
x <- seq(-6, 6, length.out = 300)
y = dbanana(x)
z <- MASS::kde2d(x, y, n = 100, lims = c(-10, 10, -2.6, 20))
plot_ly(x = z$x, y = z$y, z = sqrt(z$z)) |>
add_surface() |>
style(hoverinfo = "none")
```

You know when you hit a roadblock in your calculations, and you’re like, “*Can’t we just crunch the numbers for every single value?*” Well, let’s break it down. Picture a grid with points for dimensions. Now, brace yourself, ’cause the math needed is like raised to the power of .

So, let’s say you wanna estimate 100 points (to get a decent estimation of the shape) for each of 100 dimensions. That’s like slamming your head against **ten to the power of 200 computations**… that’s a hell of a lot of computations!

Sure, in la-la land, you could approximate every single number with some degree of approximation. But let’s get real here, even **if you had all the time in the world**, you’d still be chipping away at those calculations **until the sun swallowed the Earth**, especially with continuous cases and tons of dimensions that are somewhat correlated (which in reality, **tends to be the case**).

This headache we’re dealing with? It’s what we “*affectionately*” call — emphasis on double quotes — the **curse of dimensionality**. It’s like trying to squeeze a square peg into a round hole… it ain’t gonna happen without a supersized hammer!

```
curse_dimensionality <- data.frame(dimensions = factor((1:10)^2),
calculations = 100^((1:10)^2))
ggplot(curse_dimensionality, aes(dimensions, calculations)) +
geom_col(fill = ggsci::pal_jama()(1)) +
scale_y_continuous(transform = "log10", n.breaks = 9,
labels = scales::label_log(), expand = c(0,0,.1,0)) +
labs(y = "Computations (log-scale)", x = "Dimensions (Variables)",
title = "Computations needed to compute a grid of 100 points",
subtitle = "As a function of dimensions/variables involved") +
theme_classic(base_size = 20)
```

Explaining the curse of dimensionality further

Imagine you’re trying to create a grid **to map out the probability space for a set of variables**. As the number of dimensions increases, the number of grid points needed to adequately represent the space explodes exponentially. This means that even with the most powerful computers, it becomes **practically impossible** to compute all the probabilities accurately.

Now, if we can’t crack the problem **analytically** (which, let’s face it, is the case **most of the time**), we gotta get creative. Lucky for us, there’s a bunch of algorithms that can lend a hand by sampling this high-dimensional parameter space. Enter the **Markov Chain Monte Carlo (MCMC)** family of algorithms.

But hold up — Markov Chain Monte What? Yeah, it’s a mouthful, but bear with me. You’re probably wondering how this fancy-schmancy term is connected to exploring high-dimensional probability spaces. Well, I’ll let you in on the secret sauce behind these concepts and why **they’re the go-to tools** in top-notch probabilistic software like Stan.

But before we get into the nitty-gritty of MCMC, let’s take a detour and talk about **Markov Chains**, because they’re like the OGs of this whole MCMC gang.

Consider the following scenario: if today is rainy, the probability that tomorrow will be rainy again is 60%, but if today is sunny, the probability that tomorrow will be rainy is only 30%. However, the probability of tomorrow being sunny is 40% if today is raining, but 70% if today is sunny as well.

As you can see, **the probability of a future step depends on the current step**. This logic is central to Bayesian inference, as it allows us to talk about the conditional probability of a future value based on a previous one, like sampling across a continuous variable.

Now, let’s imagine letting time run. After a year passes, if we observe how the weather behaves, we’ll notice that the relative frequencies of each state **tend to converge** to a single number.

Now, fast forward a year. If we keep an eye on the weather every day, we’ll notice something interesting: the **relative frequencies** of rainy and sunny days **start to settle into a rhythm**. This steady state is what we call a stationary distribution. It’s like the true probability of what the weather’s gonna be like in the long run, taking into account all the different scenarios.

```
simulate_weather <- function(total_time) {
weather <- vector("character", total_time) # Create slots for each day
day <- 1 # First day
weather[day] <- sample(c("Rainy", "Sunny"), size = 1) # Weather for first day
while (day < total_time) {
day <- day + 1 # Add one more day
if (weather[day] == "Rainy") {
weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.6, .4))
} else {
weather[day] <- sample(c("Rainy", "Sunny"), size = 1, prob = c(.3, .7))
}
}
return(weather)
}
sim_time <- 365*1
weather <- simulate_weather(total_time = sim_time)
weather_data <- data.frame(
prop = c(cumsum(weather == "Rainy") / seq_len(sim_time), cumsum(weather == "Sunny") / seq_len(sim_time)),
time = c(seq_len(sim_time), seq_len(sim_time)),
weather = c(rep("Rainy", times = sim_time), rep("Sunny", times = sim_time))
)
ggplot(weather_data, aes(time, prop, fill = weather)) +
geom_area() +
scale_y_continuous(labels = scales::label_percent(), n.breaks = 6,
name = "Proportion of each weather", expand = c(0,0)) +
scale_x_continuous(name = "Days", n.breaks = 10, expand = c(0,0)) +
scale_fill_brewer(type = "qual", palette = 3) +
labs(fill = "Weather", title = "Convergence to stationary distribution",
subtitle = "Based on cumulative proportion of each Sunny or Rainy days") +
theme_classic(base_size = 20)
```

This heuristic allows us to naturally converge to an answer **without needing to solve it analytically**, which tends to be useful for really complex and high-dimensional problems.

Sure, we could’ve crunched the numbers ourselves to figure out these probabilities. But why bother with all that math when we can let time do its thing and **naturally converge to the same answer?** Especially when we’re dealing with **complex problems** that could have given even Einstein himself a headache.

Explaining the convergence process further

The idea of convergence to a stationary distribution can be likened to taking a random walk through the space of possible outcomes. Over time, the relative frequencies of each outcome stabilize, giving us a reliable estimate of the true probabilities.

As we’ve seen, sometimes it becomes impractical to solve analytically or even approximate the posterior distribution using a grid, given the number of calculations needed to even get a decent approximation of the posterior.

However, we’ve also seen that **Markov Chains** might offer us a way to compute **complex conditional probabilities** and, if we let them run long enough, they will eventually converge to the stationary distribution, which could resemble the posterior distribution itself. So, all things considered, when does the Monte Carlo part come in?

Alright, let’s break down the magic of Monte Carlo methods in plain English. Picture this: in the wacky world of random events, being able to **sample from a distribution** is like having a crystal ball to predict the future — pretty nifty, right?

Now, imagine we’re sampling from a normal probability density, say, with a mean of 80 and a standard deviation of 5. **We grab a random sample** of 10 folks, calculate their average weight, **and repeat this process** a thousand times.

In the following figure, we overlay the calculated sample mean, from each simulated sample using a histogram, to the population distribution from which we are sampling. As you can see, this sets an interesting opportunity, using this Monte Carlo simulation, **we can get an intuition of how likely** is, to our sample of 10 individuals, have a mean outside the range of 75 to 85, it’s not impossible, but it’s unlikely.

```
mean_weights <- matrix(data = rnorm(10 * 1000, 80, 5), nrow = 10, ncol = 1000) |>
colMeans()
cols <- ggsci::pal_jama()(2)
ggplot() +
stat_function(fun = ~dnorm(.x, 80, 5), xlim = c(60, 100), geom = "area", fill = cols[2]) +
geom_histogram(aes(x = mean_weights, y = after_stat(density)/3.75),
fill = cols[1], col = cols[2], binwidth = .4) +
scale_y_continuous(expand = c(0,0), name = "Density", labels = NULL, breaks = NULL) +
scale_x_continuous(expand = c(0,0), name = "Weight (kg)", n.breaks = 10) +
geom_curve(data = data.frame(
x = c(70), xend = c(74.5), y = c(0.061), yend = c(0.05)
), aes(x = x, xend = xend, y = y, yend = yend),
curvature = -.2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
geom_curve(data = data.frame(
x = c(90), xend = c(82), y = c(0.061), yend = c(0.05)
), aes(x = x, xend = xend, y = y, yend = yend),
curvature = .2, arrow = arrow(length = unit(0.1, "inches"), type = "closed")) +
geom_text(aes(x = c(67, 94), y = c(0.0605), label = c("Population\ndistribution", "Means of each\nsimulated sample")), size = 6) +
theme_classic(base_size = 20)
```

With each sample, we’re capturing the randomness of the population’s weight distribution. And hey, it’s not just about weight; **we can simulate all sorts of wild scenarios**, from multi-variable mayhem to linear model lunacy. This is the heart and soul of Monte Carlo methods: taking random shots in the dark to mimic complex processes.

But here’s the kicker: **the more samples we take, the clearer the picture becomes**. For instance, if we take ten times the amount of samples used, we would get **a better intuition about the uncertainty around the expectation** for each sample of 10 individuals, which could have important applications in the design of experiments and hypothesis testing.

And that’s where Monte Carlo methods shine. By generating a boatload of samples, we can unravel the mysteries of even the trickiest distributions, no crystal ball required. It’s a game changer for exploring the unknown without needing a PhD in rocket science.

Explaining the importance of Monte Carlo further

Monte Carlo methods provide a powerful tool for **approximating complex distributions** by sampling from them. By generating a large number of samples, we can gain insight into the shape and properties of the distribution without needing to explicitly calculate all possible outcomes.

Alright, let’s break down the basics of MCMC. Picture this: you’ve got these two heavyweights in the world of statistics, Markov Chains and Monte Carlo methods.

On one side, you’ve got Markov Chains. These bad boys **help us predict the probability of something happening based on what happened before**. It’s like saying, “Hey, if it rained yesterday, what’s the chance it’ll rain again today?”

Then, there are Monte Carlo methods. These puppies **work by randomly sampling from a distribution** to get an idea of what the whole shebang looks like. It’s like throwing a bunch of darts at a dartboard in the dark and hoping you hit the bullseye.

However the question remains, how do they team up to tackle real-world problems?

In essence, **MCMC is an algorithm that generates random samples from a proposal distribution**. These samples are accepted or rejected based on how much more likely the proposed sample is compared to the previous accepted sample.

In this way, the proposed samples are **accepted in the same proportion as the actual probability in the target distribution**, accepting more samples that are more likely and fewer samples that are less likely.

The fascinating nature of this heuristic is that it works to approximate complex distributions without needing to know much about the shape of the final distribution.

So, think of it as trekking through this complex landscape, **taking random steps** (the Monte Carlo part) but **guided by the likelihood of each move**, given where you currently stand (the Markov Chain part). It’s a meticulous journey, but one that ultimately leads us to a better understanding of these elusive distributions.

For instance, consider that we have a distribution (shown below) that we can’t to compute, because **it would take too long to integrate** the whole function. This will be our target distribution, from which we can only compute the density of **one value at a time**.

About the target distribution

In practice, we would derive the target distribution from the data and prior information, this enable us to estimate the density in a **point-wise manner**, without the need to estimate the whole PDF all at once. But for the sake of demonstration we will the use the **Gamma probability density function**.

However, please consider that you can’t use some family distribution to describe perfectly any probability density, sometimes it can be a mixture of distributions, truncation, censoring. **All comes down to the underlying process that generates the data that we are trying to mimic**.

```
# Target distribution that we in practice would derive from
# the data.
target_dist <- function(i) dgamma(i, shape = 2, scale = 1)
ggplot() +
stat_function(fun = target_dist,
xlim = c(0, 11), geom = "area",
fill = "#374E55FF") +
scale_y_continuous(breaks = NULL, name = "Density", expand = c(0,0)) +
scale_x_continuous(name = "Some scale", expand = c(0,0)) +
theme_classic(base_size = 20)
```

Next thing to do is to specify a proposal distribution, from which we’ll generate proposals for the next step. To this end we’ll be using a Normal density function with = 0 and = 1.

```
# This is a function that will generate proposals for the next step.
proprosal <- function() rnorm(1, mean = 0, sd = 1)
```

And set some algorithm parameters that are necessary for our MCMC to run:

```
## Algorithm parameters ----
total_steps <- 1000 # Total number of steps
step <- 1 # We start at step 1
value <- 10 # set a initial starting value
```

Finally, we run our algorithm as explained in previous sections. Try to follow the code to get an intuition of what is doing.

```
## Algorithm ----
set.seed(1234) # Seed for reproducibility
while(step < total_steps) {
# Increase for next step
step <- step + 1
## 1. Propose a new value ----
# Proposal of the next step is ...
value[step] <-
# the previous step plus...
value[step - 1L] +
# a change in a random direction (based on the
# proposal distribution)
proprosal()
## 2. We see if the new value is more or less likely ----
# How likely (in the target distribution)
likelihood <-
# is the proposed value compared to the previous step
target_dist(value[step]) / target_dist(value[step - 1L])
## 3. Based on its likelihood, we accept or reject it ----
# If the proposal value is less likely, we accept it only
# to the likelihood of the proposed value
if (likelihood < runif(1))
value[step] <- value[step - 1L]
# Then we repeat for the next step
}
```

Finally, let’s explore how well our algorithm converge to the target distribution.

```
mcmc <- data.frame(
step = seq_len(step),
value = value
)
ggplot(mcmc, aes(x = step, y = value)) +
geom_line(col = "#374E55FF") +
ggside::geom_ysidehistogram(aes(x = -after_stat(count)), fill = "#374E55FF", binwidth = .3) +
ggside::geom_ysidedensity(aes(x = -after_stat(count)*.35), col = "#374E55FF") +
ggside::scale_ysidex_continuous(expand = c(0,0,0,.1), breaks = NULL) +
scale_x_continuous(expand = c(0,0), name = "Step") +
scale_y_continuous(name = NULL, position = "right") +
labs(title = "Trace of MCMC values to target distribution",
subtitle = "Evolution of values at each step") +
theme_classic(base_size = 20) +
ggside::ggside(y.pos = "left") +
theme(ggside.panel.scale = .4)
```

Another thing that we care is to see **how well our MCMC is performing**. After all, if not, then what would be the point of using it in first place? To check this, **we’ll compare the expectation** () of the target distribution **against the posterior derived from our MCMC**.

For this, we have to consider that the expectation, , of any Gamma distribution is equal to the shape parameter () times by the scale parameter (). We could express the aforementioned the following.

```
ggplot(mcmc, aes(x = step, y = cumsum(value)/step)) +
geom_line(col = "#374E55FF") +
scale_x_continuous(expand = c(0,.1), name = "Steps (log-scale)",
transform = "log10", labels = scales::label_log()) +
scale_y_continuous(name = NULL, expand = c(0, 1)) +
labs(title = "Convergence to location parameter",
subtitle = "Cumulative mean across steps") +
geom_hline(aes(yintercept = 2), col = "darkred") +
geom_hline(aes(yintercept = mean(value)), lty = 2) +
annotate(x = 1.5, xend = 1.1, y = 7.5, yend = 9.5, geom = "curve", curvature = -.2,
arrow = arrow(length = unit(.1, "in"), type = "closed")) +
annotate(x = 2, y = 6.8, label = "Initial value", size = 5, geom = "text") +
annotate(x = (10^2.5), xend = (10^2.6), y = 5, yend = 2.5, geom = "curve", curvature = .2,
arrow = arrow(length = unit(.1, "in"), type = "closed")) +
annotate(x = (10^2.5), y = 5.8, label = "Convergence", size = 5, geom = "text") +
theme_classic(base_size = 20)
```

This general process is central to MCMC, but more specifically to the **Metropolis-Hastings** algorithm. However, and in order to broaden our understanding, let’s explore additional MCMC algorithms beyond the basic Metropolis-Hastings with some simple examples.

Imagine you’re at a buffet with stations offering various cuisines — Italian, Chinese, Mexican — you name it. **You’re on a mission to create a plate with a bit of everything**, but here’s the catch: you can only visit one station at a time. Here’s how you tackle it:

- Hit up a station and
**randomly pick a dish**. **Move on to the next station**and repeat the process.- Keep going until you’ve got a plateful of diverse flavors.

Gibbs sampling works kind of like this buffet adventure. **You take turns sampling from conditional distributions**, just like you visit each station for a dish. Each time, you focus on one variable, **updating its value while keeping the others constant**. It’s like building your plate by sampling from each cuisine until you’ve got the perfect mix.

Picture yourself hiking up a rugged mountain with rocky trails and valleys. Your goal? **Reach the summit without breaking a sweat — or falling off a cliff**. So, you whip out your map and binoculars to plan your route:

- Study the map to plot a path with
**minimal uphill battles and maximum flat stretches**. - Use the binoculars to
**scout ahead and avoid obstacles**along the way. **Adjust your route as you go**, smoothly navigating the terrain like a seasoned pro.

Hamiltonian Monte Carlo (HMC) is a bit like this hiking adventure. It simulates a particle moving through a high-dimensional space, **using gradient info to find the smoothest path**. Instead of blindly wandering, HMC leverages the curvature of the target distribution to explore efficiently. It’s like hiking with a GPS that guides you around the rough spots and straight to the summit.

Now that you’ve dipped your toes into the MCMC pool, it’s time to talk turkey — well, sampling. Each MCMC method has its perks and quirks, and knowing them is half the battle.

**Gibbs sampling** is the laid-back surfer dude of the group — simple, chill, and **great for models with structured dependencies**. But throw in some highly correlated variables, and it starts to wobble like a rookie on a surfboard.

Meanwhile, **HMC** is the sleek Ferrari — efficient, powerful, and **perfect for tackling complex models** head-on. Just don’t forget to fine-tune those parameters, or you might end up spinning out on a sharp curve.

**Metropolis-Hastings**: Takes random walks to generate samples, with acceptance based on a ratio of target distribution probabilities.**Gibbs Sampling**: Updates variables one by one based on conditional distributions, like a tag team wrestling match.**Hamiltonian Monte Carlo**: Glides through high-dimensional space using deterministic trajectories guided by Hamiltonian dynamics, like a graceful dancer in a crowded room.

**Metropolis-Hastings**: Easy to implement but might struggle to explore efficiently, especially in high-dimensional spaces.**Gibbs Sampling**: Perfect for structured models but may stumble with highly correlated variables.**Hamiltonian Monte Carlo**: Efficiently navigates high-dimensional spaces, leading to faster convergence and smoother mixing.

**Metropolis-Hastings**: Decides whether to accept or reject proposals based on a ratio of target distribution probabilities.**Gibbs Sampling**: Skips the acceptance drama and generates samples directly from conditional distributions.**Hamiltonian Monte Carlo**: Judges proposals based on the joint energy of position and momentum variables, like a strict dance instructor.

**Metropolis-Hastings**: Requires tweaking the proposal distribution but keeps it simple.**Gibbs Sampling**: A breeze to implement, but watch out for those conditional distributions — they can be sneaky.**Hamiltonian Monte Carlo**: Needs tuning of parameters like step size and trajectory length, and the implementation might get a bit hairy with momentum variables and gradient computation.

In the following, you can see **an interactive animation** of different MCMC algorithms (MH, Gibbs and HMC) and **how they work** to uncover distributions in two dimensions. The code for this animation is borrowed from Chi Feng’s github. You can find the original repository with corresponding code here: https://github.com/chi-feng/mcmc-demo

Alright, theory’s cool and all, but let’s get down to brass tacks. When you’re rolling up your sleeves to implement MCMC algorithms, it’s like picking the right tool for the job. Simple models? **Metropolis-Hastings** or **Gibbs sampling** has your back. But when you’re wrangling with the big boys — those complex models — that’s when you call in **Hamiltonian Monte Carlo**. It’s like upgrading from a rusty old wrench to a shiny new power tool. And don’t forget about tuning those parameters — it’s like fine-tuning your car for a smooth ride.

Beyond all the technical jargon, successful Bayesian inference is part gut feeling, part detective work. Picking the right priors is like seasoning a dish — you want **just the right flavor without overpowering everything else**. And tuning those parameters? It’s like fine-tuning your favorite instrument to make sure the music hits all the right notes.

But hey, nothing worth doing is ever a walk in the park, right? MCMC might be the hero of the Bayesian world, but **it’s not without its challenges**. Scaling up to big data? It’s like trying to squeeze into those skinny jeans from high school — uncomfortable and a bit awkward. And exploring those complex parameter spaces? **It’s like navigating a maze blindfolded**.

But fear not! There’s always a light at the end of the tunnel. Recent innovations in Bayesian inference, like **variational inference** (we’ll tackle this cousin of MCMC in the next posts) and **probabilistic programming languages** (like Stan), are like shiny beacons, guiding us to new horizons.

These days, some probabilistic programming languages, **like Stan**, use a souped-up version of the Hamiltonian Monte Carlo algorithm with **hyperparameters tuned on the fly**. These tools are like magic wands that turn your ideas into reality, just specify your model, and let the parameter space exploration happen **in the background**, no sweat.

As we wrap up our journey into the world of MCMC, let’s take a moment to appreciate the wild ride we’ve been on. MCMC might not wear a cape, but it’s the unsung hero behind so much of what we do in **Bayesian data analysis**. It’s the tool that lets us dive headfirst into the **murky waters of uncertainty** and come out the other side with **clarity and insight**.

In the next installment of our MCMC series, we’ll dive into the infamous **Hamiltonian Monte Carlo** and unravel the statistical wizardry behind this and other similar algorithms. Until then, happy sampling!

BibTeX citation:

```
@misc{castillo-aguilar2024,
author = {Castillo-Aguilar, Matías},
title = {Markov {Chain} {Monte} {What?}},
date = {2024-04-25},
url = {https://bayesically-speaking.com/posts/2024-04-14 mcmc part 1/},
doi = {10.59350/mxfyk-6av39},
langid = {en}
}
```

For attribution, please cite this work as:

Castillo-Aguilar, Matías. 2024. “Markov Chain Monte What?”
April 25, 2024. https://doi.org/10.59350/mxfyk-6av39.

First of all, welcome to the first post of “Bayesically Speaking” (which, in case you haven’t noticed, is a word play between “Basically Speaking” and the (hopefully) well-known Bayes’ theorem), and although the web is offline at the time of writing this article, I find myself following the advice of all those people who encouraged me to trust my instinct and dare to do what I have always wanted: to be able to transmit the thrill of using science as a tool to know and understand the reality that surrounds us and that we perceive in a limited way through our senses.

For years, my interests have revolved around understanding the world through the lens of statistics, particularly as a tool to better understand and quantify the relationships between the moving parts that make up many health outcomes. Another aspect that I find fascinating is how certain variables can go unnoticed when viewed separately, but when viewed together can have radically different behaviors.

```
sim_data <- simulate_simpson(n = 100,
difference = 2,
groups = 4,
r = .7) |>
as.data.table()
sim_data[, Group := factor(Group,
levels = c("G_1","G_2","G_3","G_4"),
labels = c("Placebo", "Low dose", "Medium dose", "High dose"))]
ggplot(sim_data, aes(V1, V2, col = Group)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(method = "lm", aes(group = 1, col = NULL)) +
scale_color_brewer(type = "qual", palette = 2) +
labs(x = "Time exposure", y = expression(Delta*"TNF-"*alpha)) +
theme_classic() +
theme(legend.position = "top")
```

Within the statistics toolbox, we have commonly used tests like t-tests, ANOVA, correlations, and regression. These methods have their advantages, as they are relatively easy to use and understand. However, like any other tool, they also have their limitations. For instance, they struggle with scenarios involving variables with asymmetric distributions, non-linear relationships, unbalanced groups, heterogeneous variance, extreme values, or repeated measurements with loss of follow-up.

To address these limitations, non-parametric alternatives have been developed. These approaches offer flexibility but make it challenging to extrapolate inferences to new data due to the lack of distributional parameters. Other models, such as neural networks or random forest models, provide assistance when analyzing data with special properties. However, they often sacrifice simplicity and interpretability for increased flexibility and are commonly referred to as “black box” models.

Despite the availability of these alternative methods, there is still a pressing need to incorporate previous knowledge and align with the way human understanding is constructed. As humans, our perception of the world is shaped by experiences and prior beliefs. This is where Bayesian statistics come into play.

Bayesian statistics offer several advantages over classical statistics (also known as “frequentist”). Firstly, they provide a coherent framework for incorporating prior information into our analysis, enabling us to update our beliefs systematically. Additionally, Bayesian statistics allow us to quantify uncertainty through probability distributions, offering a more intuitive and interpretable way to express our findings and the degree of certainty.

Let’s consider the following example: Imagine we have a belief that when tossing a coin, there is a higher probability of it landing on heads. Our prior knowledge stems from a previous experiment where, out of 15 coin tosses, 10 resulted in heads. This implies a calculated probability of based on the previous data.

With this information we decide to explore further, we conduct our own experiment. To our astonishment, out of 15 tosses, we observe un unexpected outcome: 13 tails and only 2 heads! This result suggests that the probability of getting heads based solely on our new data is a mere . However, it would be unwise to dismiss the prior evidence in light of these conflicting results. Incorporating these findings into our body of knowledge becomes even more crucial as we strive to gain a deeper understanding of the combined effect.

To estimate the posterior probability of getting heads after tossing a coin, we can use the Bayesian framework. Let’s denote the probability of getting heads in a coin toss as .

According to the information provided, we have the prior probability of estimated from an independent experiment as 10 heads out of 15 tosses. This can be written as a Beta distribution:

This symbol “” means *distributed as*

Here, the Beta distribution parameters are (10, 5) since we had 10 heads and 5 tails in the prior experiment.

Now, a new experiment with the same 15 tosses gives us 2 heads. To update our prior belief, we can use this information to calculate the posterior probability which can be expressed as follow:

This symbol “” means *proportional to*

Which is equivalent as saying:

To calculate the posterior probability, we need to normalize the product of the likelihood and prior, which involves integrating over all possible values of H. However, in this case, we can use a shortcut because the prior distribution is conjugate to the binomial distribution, so the posterior distribution will also follow a Beta distribution:

About normalization

The product of both the prior and the likelihood maintains the same shape as the final posterior probability distribution, indicated by the “proportional to” () in the previous equation. However, this raw product does not sum up to 1, making it an improper probability density function. To rectify this, the raw product needs to be normalized using integration or simulation in most cases.

After incorporating the data from the new experiment, the parameters of the Beta distribution become (12, 18) since we had 2 heads and 13 tails in the new experiment, meaning 12 heads and 18 tails in total.

About conjugacy

When we choose a Beta distribution as our prior belief and gather new data from a coin toss, an intriguing property emerges: the posterior distribution also follows a Beta distribution. This property, known as conjugacy, offers a valuable advantage by simplifying calculations. It acts as a mathematical shortcut that saves time and effort, making the analysis more efficient and streamlined.

To calculate the posterior probability of getting heads, we can consider the mode (maximum) of the Beta distribution, which is :

Therefore, the posterior probability of getting heads is approximately 39% when we consider all the available evidence.

```
# Prior and Likelihood functions
data = function(x, to_log = FALSE) dbeta(x, 2, 13, log = to_log)
prior = function(x, to_log = FALSE) dbeta(x, 10, 5, log = to_log)
# Posterior
posterior = function(x) {
p_fun = function(i) {
# Operation is on log-scale merely for computing performance
# and minimize rounding errors giving the small nature of
# probability density values at each interval.
i_log = data(i, to_log = TRUE) + prior(i, to_log = TRUE)
# Then transformed back to get probabilities again
exp(i_log)
}
# Then we integrate using base function `integrate`
const = integrate(f = p_fun,
lower = 0L, upper = 1L,
subdivisions = 1e3L,
rel.tol = .Machine$double.eps)$value
p_fun(x) / const
}
## Plotting phase
### Color palette
col_pal <- c(Prior = "#DEEBF7",
Data = "#3182BD",
Posterior = "#9ECAE1")
### Main plotting code
ggplot() +
#### Main probability density functions
stat_function(aes(fill = "Data"), fun = data, geom = "density", alpha = 1/2) +
stat_function(aes(fill = "Prior"), fun = prior, geom = "density", alpha = 1/2) +
stat_function(aes(fill = "Posterior"), fun = posterior, geom = "density", alpha = 1/2) +
#### Minor aesthetics tweaks
labs(fill = "", y = "Density", x = "Probability of getting heads") +
scale_fill_manual(values = col_pal, aesthetics = "fill") +
scale_x_continuous(labels = scales::label_percent(),
limits = c(0,1)) +
scale_y_continuous(expand = c(0,0), limits = c(0, 6.5)) +
see::theme_modern() +
theme(legend.position = "top",
legend.spacing.x = unit(3, "mm")) +
#### Arrows
geom_curve(aes(x = .81, y = 4.1, xend = .69232, yend = 3.425), curvature = .4,
arrow = arrow(length = unit(1/3, "cm"), angle = 20)) +
geom_text(aes(x = .9, y = 4.1, label = "Beta(10,5)")) +
geom_curve(aes(x = .2, y = 5.9, xend = .07693, yend = 5.45), curvature = .4,
arrow = arrow(length = unit(1/3, "cm"), angle = 20)) +
geom_text(aes(x = .29, y = 5.85, label = "Beta(2,13)")) +
geom_curve(aes(x = .5, y = 5, xend = .3847, yend = 4.4), curvature = .4,
arrow = arrow(length = unit(1/3, "cm"), angle = 20)) +
geom_text(aes(x = .55, y = 5, label = "≈ 39%"))
```

This example truly showcases the power of Bayesian statistics, where our prior beliefs are transformed by new evidence, allowing us to gain deeper insights into the world. Despite the unexpected twists and turns, Bayesian inference empowers us to blend prior knowledge with fresh data, creating a rich tapestry of understanding. By embracing the spirit of Bayesian principles, we open doors to the exciting potential of statistics and embark on a captivating journey to unravel the complexities of our reality.

What’s even more fascinating is how closely the Bayesian inference process aligns with our natural way of learning and growing. Just like we integrate prior knowledge, weigh new evidence, and embrace uncertainty in our daily lives, Bayesian inference beautifully mirrors our innate cognitive processes. It’s a dynamic dance of assimilating information, refining our understanding, and embracing the inherent uncertainties of life. This remarkable synergy between Bayesian inference and our innate curiosity has been a driving force behind the rise and success of Bayesian statistics in both theory and practice.

As Gelman and Shalizi (2013) eloquently states, “A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics.”

However, it’s important to acknowledge that while advanced statistical tools offer incredible possibilities, they also come with their own set of limitations. To make the most of these tools, we need to understand their boundaries and make informed choices about their applications.

Imagine a time not too long ago when Bayesian statistics were not as prevalent as they are today. The computational challenges posed significant hurdles, limiting our ability to fully embrace their potential. But thanks to the rapid advancement of computing power and simulation techniques, the statistical landscape has undergone a revolution. We now find ourselves in an exciting era where complex Bayesian analysis is accessible to all. It’s like having a superpower in the palm of our hands—an empowering time where our statistical prowess can thrive and conquer new frontiers.

As passionate self-learners on this thrilling statistical journey, even without a formal statistician’s hat, we can’t help but feel an overwhelming excitement to share the vast potential of these tools for unraveling real-world phenomena. Delving into the world of statistics, especially through the lens of Bayesian inference, opens up a universe of captivating possibilities. By melding prior knowledge with fresh evidence and embracing the enigmatic realm of uncertainty, we can uncover profound insights into health, well-being, and the wondrous phenomena that shape our lives.

So, fellow adventurers, let’s ignite our curiosity, embrace our thirst for knowledge, and embark on this exhilarating voyage together. With statistics as our compass, we will navigate the complexities of our reality, expanding our understanding and seizing the extraordinary opportunities that await us.

Get ready to experience a world that’s more vivid, more nuanced, and more awe-inspiring than ever before. Together, let’s dive into the captivating realm of statistics, fueled by enthusiasm and a passion for discovery.

Gelman, Andrew, and Cosma Rohilla Shalizi. 2013. “Philosophy and the Practice of Bayesian Statistics.” *British Journal of Mathematical and Statistical Psychology* 66 (1): 8–38.

BibTeX citation:

```
@misc{castillo-aguilar2023,
author = {Castillo-Aguilar, Matías},
title = {Welcome to {Bayesically} {Speaking}},
date = {2023-06-10},
url = {https://bayesically-speaking.com/posts/2023-05-30 welcome/},
doi = {10.59350/35tc8-qyj10},
langid = {en}
}
```

For attribution, please cite this work as:

Castillo-Aguilar, Matías. 2023. “Welcome to Bayesically
Speaking.” June 10, 2023. https://doi.org/10.59350/35tc8-qyj10.