HMC by Hand: Gradients, Heartbeats, PDEs and Other Tales

We already saw in previous posts how to set up your own Hamiltonian Monte Carlo (HMC) sampler to explore the parameter space. However, there was something we didn’t cover in that or other posts: how to set up the gradient function that will guide the sampler through the parameter space. So buckle up! We’ll dive right into the nitty gritty of math, gradients and partial derivatives by hand!

by-hand
mcmc
algorithms
non-linear
educational
Author
Published

April 19, 2025

Doi

Introduction

When we exercise, our hearts speed up like they’re late for a date with a treadmill. This acceleration is reflected in the shrinking time between successive heartbeats, what the cool kids call the R‑R interval (RRi). If you’ve ever squinted at an electrocardiogram (ECG or EKG) trace, those dramatic vertical spikes are the R‑waves of each cardiac cycle. The time between two R‑waves gives us the RRi, usually measured in milliseconds. It’s deceptively simple, like a one‑ingredient recipe, yet it packs a punch as a real‑time indicator of both your heart rate and the tug‑of‑war between your sympathetic and parasympathetic nervous systems.

So, let’s connect the dots: during exercise, your sympathetic “fight‑or‑flight” side takes the wheel and the RRi shrinks (i.e., heart rate accelerates); during rest or recovery, the parasympathetic “chill‑ax” team rejoins, and the RRi lengthens (i.e., heart rate slows). This seesaw between autonomic branches crafts dynamic RRi patterns that reveal your physiological status and your cardio‑adaptability, like a Netflix algorithm that knows whether you’re into rom‑coms or true crime.

The RRi Dance: From Rest to Recovery

flowchart TD
  a((Rest)) --->| High RRi | b((Exercise))
  b --->| RRi decreases | c((Recovery))
  c -->| RRi increases | a

Picture it: at rest, your RRi is high, like a sloth binge‑watching its favorite show. Then you hit the gym, your RRi plummets faster than Bitcoin during a market crash. Once you stop, there’s a gradual recovery, reminiscent of that cliffhanger in a Marvel movie, slow build-up, dramatic pause, and finally resolution.

But calling it just “fast-to-slow” is like describing Game of Thrones as “a show with dragons”. The RRi curve is packed with hidden Easter eggs: baroreflex sensitivity, vagal rebound, sympathetic withdrawal, and more. To mine these gems, we need a model that actually understands the plot twist.

Enter the Castillo‑Aguilar et al. (2025) nonlinear RRi‑vs‑Time model. This beauty captures the composite shape of RRi dynamics across exercise and recovery, embedding asymmetric decay and rebound kinetics, plateau phases, and the “everyone’s heart is different” variability in curve smoothness.

Example exercise‑induced RRi‑vs‑time dynamics and key parameters governing the overall curve behavior.

Why Modeling RRi Dynamics Matters

Cardiovascular regulation is about as linear as a Christopher Nolan plot, it loops back on itself, throws in surprises, and keeps you guessing. A solid model of RRi behavior lets researchers and clinicians extract juicy parameters from data: the time constant of vagal reactivation (your parasympathetic chill‑out speed), the latency of sympathetic offset (how quickly you hit the brakes), or the asymptotic RRi level post‑exercise (your heart’s version of “and they lived happily ever after”). These metrics can be biomarkers of autonomic flexibility, cardiorespiratory fitness, or even frailty, because sometimes your heart is a delicate snowflake.

But modeling is only half the battle. Fitting a nonlinear model to real‑world RRi data is like reverse‑engineering Thanos’s gauntlet: given the snap‑tastic curve, which combination of infinity stones (parameters) did it take? The RRi response is notoriously idiosyncratic, highly subject‑specific, noisy, and often non‑stationary. You can fit the model to two different folks and get parameter sets as different as Luke and Darth Vader. Slight data noise and, boom, your fit is more unstable than a Jenga tower in an earthquake.

That’s when uncertainty quantification becomes your trusty sidekick.

A Matter of Guessing (But Doing It Right)

Think of modeling RRi dynamics as dial‑tuning a vintage synthesizer at a rave: decay rates, asymptotes, inflection points, all those knobs control the final beat drop (or curve). Sure, you might stumble on a combo that sounds (or fits) right, but are you sure it’s unique? Could a tweak in the opposite direction produce the same vibe? Which knobs actually matter, and which are just decorations?

Enter Bayesian parameter estimation, which doesn’t just pick the “best” knob settings; it surveys the entire festival grounds of plausible settings given the data. In Bayesian land, each combination is a point in a high‑dimensional rave, our goal is to map out the posterior distribution across that landscape.

Some regions are crowded (high probability), others are deserted wastelands (low probability). Characterizing these areas yields not only central estimates but also credibility intervals, parameter correlations, and maybe even multimodal party clusters. But exploring this terrain is like traversing Mordor, classical Markov Chain Monte Carlo (MCMC) methods such as Metropolis‑Hastings can take eons to converge, especially when dimensions spike or parameters gang up in strong correlations.

Cue the dramatic entry of Hamiltonian Monte Carlo (HMC).

Hamiltonian Monte Carlo: Beyond Random Walks

HMC is like giving your sampler a jetpack and a map, you don’t just wander aimlessly; you use gradient info (the slope of the log‑posterior, AKA “how steep is this hill”) to guide each leap. Instead of stumbling around blind (random‑walk MCMC), HMC surfs the posterior landscape, riding down slopes and coasting over plateaus.

Picture this: you’re Neo in The Matrix, seeing the curves of the probability surface, riding a particle through space with grace and style. The position of the particle is your set of model parameters (\(q\)), and momentum (\(p\)) is an auxiliary variable that’s basically your inertia, like that moment when you sprint downhill and can’t immediately stop.

Simulating this combo follows Hamiltonian mechanics, where the total energy (potential + kinetic) is conserved, think of it as your personal “always‑on” energy bar.

From Posterior to Physics: Hamilton’s Equations

The Hamiltonian encapsulates the system’s energy:

\[ H(q, p) = U(q) + K(p) \]

  • \(U(q)\) is the potential energy, here, the negative log posterior (so higher probability = lower potential).
  • \(K(p)\) is the kinetic energy, usually a quadratic form in momentum, like \(K(p) = \tfrac12 p^T M^{-1} p\).

The magic happens when you integrate Hamilton’s equations:

\[ \frac{dq}{dt} = \frac{\partial H}{\partial p}, \quad \frac{dp}{dt} = -\frac{\partial H}{\partial q} \]

These reversible, volume‑preserving dynamics let the particle glide through parameter space, conserving total energy, no crazy backtracking or random wanderings. HMC thus can make long, informed jumps without tanking acceptance rates, unlike our hapless Metropolis‑friends.

But here’s the kicker: you need the gradient of the potential energy, i.e., the gradient of the negative log‑posterior (log‑likelihood plus log‑prior). Since we’re assuming flat priors (because today’s focus is the likelihood), the prior contribution drops out, flat function means zero slope:

\[ \nabla U(q) = - \left(\nabla \log L(q) + 0 \right) = - \nabla \log L(q). \]

So for our tutorial, we set \(\nabla U(q) = -\nabla \log L(q)\) and call it a day. (Yes, that’s the pedagogical shortcut. No judgment.)

The Gradient Grind: What Powers HMC

Probabilistic programming languages, Stan, PyMC, Turing.jl, automate gradient computation via automatic differentiation. It’s like having Iron Man’s suit calculate your every move. But manually deriving gradients is like opening up the arc reactor: you see which wires do what, you learn where the sparks fly, and you gain true respect for the machinery.

Our RRi‑vs‑Time model is a cocktail of sigmoids and exponentials. Each parameter (\(\alpha, \beta, c, \lambda, \phi, \tau, \delta\)) reshapes the curve in a nonlinear fashion. The log‑likelihood assembles residuals, differences between observed and predicted RRi, under a Gaussian error model. Deriving its gradient means wielding the chain rule, product rule, and the occasional identity crisis (“wait, am I differentiating with respect to \(\phi\) or existential dread?”).

Yes, it’s math‑heavy and notation‑dense, like deciphering ancient Sith scripts. But:

It’s worth it.

A Journey Ahead

In a previous post, we showed how to implement the RRi-vs-Time model in R using HMC. We defined priors, specified the likelihood, and let the sampler do its job. But we didn’t dwell on how the gradient was computed, nor on the intricacies of the target function with respect to each parameter.

That’s what this post is about.

We will derive the full gradient of the RRi-vs-Time likelihood function with respect to its parameters, manually. We will examine each component, apply calculus step-by-step, and connect the math back to the model’s physiological interpretation. It will be dense. But it will also be enlightening. This isn’t just a math exercise, it’s a way to sharpen your modeling intuition, improve your Bayesian inference skills, and gain deeper insight into cardiovascular dynamics.

Because once you’ve hand‑crafted those derivatives, you’ll know why HMC outruns vanilla MCMC, where your model is too fragile, and which parameters truly pull the strings. You’ll transform from a passive sampler‑passenger into the Hamiltonian driver of your own probabilistic saga.

So dust off your calculus cape and let’s dive in, one partial derivative, one leapfrog step at a time. May the gradients be ever in your favor.

Okay, let’s ditch the stuffy academic speak and inject some much-needed personality into this whole data-generating circus. We’re not just dealing with numbers here; we’re uncovering the hidden secrets of the universe… or at least, how this particular set of numbers decided to show up.

The Grand, Totally Spontaneous (Not Really) Data Generation Process

So, imagine our observed RRi signal (\(y_i\)), let’s call it the “Witness Data”. This witness data claims to come from a Gaussian distribution. Think of it like a slightly unreliable source who swears everything they do is perfectly normal and totally centered, with just a smidge of random chaos (\(\sigma^2\)) thrown in for plausible deniability. Mathematically, because we’re fancy like that, we write its confession as:

\[ y_i \sim \mathcal{N}(M_i, \sigma^2) \]

But here’s the kicker, the plot twist worthy of an M. Night Shyamalan film: the true signal, the one our witness data is supposed to be centered around (\(M_i\)), isn’t just some random dude off the street. Nope, it’s a carefully constructed masterpiece, a puppet controlled by our super-duper, top-secret model, \(f(t_i \mid \theta_k)\).

So, let \(M: f(t \mid \theta_k)\). This \(M\) is basically the puppet master, and its strings are pulled by a gang of parameters, \(\theta_k\). These parameters are like the secret sauce, the cheat codes, the actual reason things look the way they do. They are \(k \in \{\alpha, \beta, c, \lambda, \phi, \tau, \delta\}\), a diverse crew with questionable motives. And when they all work together (or plot against each other, who knows?), they produce the true signal according to this highly specific, slightly intimidating formula:

\[ f(t \mid \theta_k) = \alpha + \frac{\beta}{1 + e^{\lambda (t - \tau)}} - \frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}} \]

Yeah, looks complicated, right? Like a boss battle in a video game. But somebody’s gotta figure out what those \(\theta_k\) values are! We need to estimate the parameters of this whole “data generation process” so we can, you know, reproduce the observed data. It’s like trying to reverse-engineer a secret family recipe, essential for future culinary (or, in this case, statistical) endeavors.

Arriving at the Log Likelihood: Or, How We Stop Panicking About Tiny Numbers

Okay, remember that whole “Gaussian distribution” thing? Good, because that’s our starting point for figuring out just how likely our observed data is, given our model and its parameters. The Gaussian probability function is basically its ID card, its fingerprint. It looks like this, in all its slightly intimidating glory:

\[ \mathcal{N}_\text{pdf}(y \mid M, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(y - M)^2}{2\sigma^2}} \]

Now, the likelihood for a set of data points (\(y_i\)) is like asking for the probability of all those data points showing up, assuming they’re independent (because nobody likes clingy data points). You’d think we just multiply all those individual probabilities together, right? Like collecting all the Infinity Stones. And you’d be correct!

\[ L(y_i \mid M, \sigma^2) = \prod_{i = 1}^{N}{\frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(y_i - M)^2}{2\sigma^2}}} \]

BUT (and it’s a big “but”, like Sir Mix-a-Lot approved), multiplying a zillion tiny probabilities together gives you an even TINIER number. We’re talking smaller than your chances of winning the lottery while being struck by lightning on a Tuesday. This makes computers sad and prone to errors. So, like any good hacker, we use a trick: logarithms! Logarithms turn those scary multiplications into friendly additions, and they keep our numbers from disappearing into the computational abyss. It’s basically applying a cheat code for numerical stability. Hence, the logarithm of our likelihood looks like this after a bit of log-magic:

\[ \log L(y_i \mid M, \sigma^2) = \sum_{i = 1}^{N}{\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(e^{-\frac{(y_i - M)^2}{2\sigma^2}}\right)} \]

Now, let’s simplify this beast. The first term is like trying to divide by taking logs (quotient rule!), and the second term? Well, the logarithm and the exponential function (\(e\)) are mortal enemies who cancel each other out on sight, like two rival gang leaders. Think of it as \(\log(e^x) = x\), a classic takedown!

\[ \log L(y_i \mid M, \sigma^2) = \sum_{i = 1}^{N}{\log(1) - \log(\sqrt{2\pi\sigma^2}) - \frac{(y_i - M)^2}{2\sigma^2}} \]

Even better, \(\log(1)\) is a total slacker and equals zero. And that square root inside the log? We can use the power rule for logarithms to pull that \(1/2\) out front, like a stage manager pushing a performer into the spotlight. \(\log((2\pi\sigma^2)^\frac{1}{2})\) becomes \(\frac{1}{2}\log(2\pi\sigma^2)\). Boom!

\[ \log L(y_i \mid M, \sigma^2) = \sum_{i = 1}^{N}{-\frac{1}{2} \log(2\pi\sigma^2) - \frac{(y_i - M)^2}{2\sigma^2}} \]

Finally, the first term in the summation doesn’t even depend on \(i\) (the specific data point), so we can just multiply it by \(N\), the total number of data points, like collecting individual participation trophies and just giving everyone one big “You Showed Up” trophy. The summation only needs to hang around for the term that actually changes with each data point. This leaves us with the glorious, final form of our Gaussian log-likelihood function:

\[ \boxed{\log L(y_i \mid M, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}}} \]

And there you have it! We’ve wrestled the likelihood function, tamed it with logarithms, and forced it into this neat, much more manageable form. Now we can actually use this for things like, you know, estimating those sneaky \(\theta_k\) parameters we talked about earlier.

Alright, buckle up, buttercups! We’ve got our fancy log-likelihood function, which is basically our map to finding the “best” parameters. But how do we actually climb this parameter mountain? We need the gradient, the mathematical equivalent of a Sherpa pointing you towards the steepest uphill path. Let’s figure out how this beast changes when we poke it with respect to its various components.

Estimating the Gradient of the Function: Or, Poking the Math Beast to See What Happens

With Respect to \(M\): The Mean, The Myth, The Model

First up, let’s see how our log-likelihood changes when we mess with \(M\). Remember, \(M\) is our model’s output, the suave operator pulling the strings. We’re taking the partial derivative, which is like asking, “If only M changes, how does the log-likelihood react?”

\[ \frac{\partial \log L}{\partial M} \left( -\frac{N}{2} \log(2\pi\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

Look closely! That first term? \(-\frac{N}{2} \log(2\pi\sigma^2)\)? It’s got zero M-factor. It’s like a background extra who doesn’t speak any lines involving our star, \(M\). So, in the world of partial derivatives, it vanishes like my motivation on a Monday morning. Poof! Gone.

\[ \frac{\partial \log L}{\partial M} \left( - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

Now, let’s tidy up. The \(2\sigma^2\) in the denominator is just chilling there, unaffected by \(M\). We can pull it out of the summation like a stagehand removing a prop that’s in the way.

\[ \frac{\partial \log L}{\partial M} \left( - \frac{1}{2\sigma^2} \sum_{i = 1}^{N}{ (y_i - M)^2} \right) \]

Okay, time for the Chain Rule! This is like assembling a transformer, derive the outer layer (the exponent), then dive inside (the term with \(M\)). The exponent (2) comes down, we keep the inner term \((y_i - M)\), and then we multiply by the derivative of the inner term with respect to \(M\). Since the derivative of \((y_i - M)\) with respect to \(M\) is \(-1\), and \(M\) itself is a function of our parameters \(\theta_k\), it gets a little meta here. We’re essentially deriving with respect to \(M\), which is dependent on \(\theta_k\). It’s a derivative inception!

\[ \frac{\partial \log L}{\partial M} = - \frac{1}{2\sigma^2} \sum_{i = 1}^{N}{ (2)(y_i - M)(-\frac{\partial M}{\partial \theta_k})} \]

Notice that \(M\) is indeed a function controlled by our parameter squad \(\theta_k\), so “deriving \(M\)” implies that we are getting ready to figure out how changes in those parameters eventually affect \(M\) and thus the log-likelihood. The \(-\frac{\partial M}{\partial \theta_k}\) term is a placeholder for that future reckoning.

Let’s clean this up. The ‘2’ from the exponent comes out and cancels with the ‘2’ in the denominator (\(2 / (2\sigma^2) = 1/\sigma^2\)). And that negative sign from the derivative of \((y_i - M)\) with respect to \(M\) (which was \(-1\)) clashes with the negative sign out front, resulting in a dramatic, sign-cancelling explosion! They effectively high-five and disappear.

So, after the dust settles, we are left with this beauty, the gradient with respect to \(M\):

\[ \boxed{\frac{\partial \log L}{\partial M} = \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M) \cdot \frac{\partial M}{\partial \theta_k}}} \]

Bam! That tells us how much the log-likelihood score changes if we nudge \(M\) a tiny bit, mediated by how \(M\) itself is influenced by the parameters \(\theta_k\). It’s like knowing which way to push the puppet to get the desired stage effect.

With Respect to \(\sigma^2\): The Chaos Coordinator

Now for the spicy one: \(\sigma^2\). This is the variance, the measure of how much our data points are just vibing away from the true signal \(M\). How does our log-likelihood react when we tweak the level of chaos? Let’s revisit the log-likelihood expression we had before things got too wild:

\[ \frac{\partial \log L}{\partial \sigma} \left( -\frac{N}{2} \log(2\pi\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

Let’s attack the first term like it owes us money. \(-\frac{N}{2}\) is just a constant multiplier, so it hangs back. We need to pull the \(\sigma^2\) out of the log term. We can use the logarithm rule \(\log(ab) = \log(a) + \log(b)\). So, \(\log(2\pi\sigma^2) = \log(2\pi) + \log(\sigma^2)\). Don’t forget to distribute that \(-\frac{N}{2}\) like a bad virus!

\[ \frac{\partial \log L}{\partial \sigma} \left( -\frac{N}{2} \log(2\pi) -\frac{N}{2} \log(\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

See that first term now? \(-\frac{N}{2} \log(2\pi)\)? No \(\sigma^2\) in sight! It’s another extra, standing around looking awkward. Adios!

\[ \frac{\partial \log L}{\partial \sigma} \left( -\frac{N}{2} \log(\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

Now, let’s use the power rule on \(\log(\sigma^2)\). The ‘2’ hops out front, multiplying the \(-\frac{N}{2}\). Like a superhero team-up, the ’2’s cancel! \(-\frac{2N}{2} \log(\sigma) = -N \log(\sigma)\). Looking better!

\[ \frac{\partial \log L}{\partial \sigma} \left( -N \log(\sigma) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}} \right) \]

For the second term, let’s get it ready for its close-up. We can pull out constants and rewrite \(\frac{1}{\sigma^2}\) as \(\sigma^{-2}\) to make differentiation easier. And because writing \(\sum_{i = 1}^{N} (y_i - M)^2\) repeatedly is tedious, let’s call it \(S\) for “Sum of Squares of Deviations from the Mean” (catchy, right?).

\[ \frac{\partial \log L}{\partial \sigma} \left( -N \log(\sigma) - \frac{\sigma^{-2}}{2} S \right) \]

Alright, Showtime! Let’s derive. The derivative of \(-N \log(\sigma)\) with respect to \(\sigma\) is \(-N\) times the derivative of \(\log(\sigma)\), which is \(\frac{1}{\sigma}\). So, \(-\frac{N}{\sigma}\). Piece of cake! For the second term, \(\frac{\sigma^{-2}}{2} S\), we use the power rule again. The exponent \(-2\) comes down and multiplies \(\frac{1}{2}\), canceling them out (\((-2) \times \frac{1}{2} = -1\)). The exponent on \(\sigma\) decreases by 1, becoming \(\sigma^{-3}\). Don’t forget the \(S\) hanging around! This gives us \(-\sigma^{-3} S\).

Putting it all together, we get:

\[ \frac{\partial \log L}{\partial \sigma} = -\frac{N}{\sigma} + \sigma^{-3} S \]

This is because There was a negative sign in front of the summation term originally: \(-\sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}}\). When we rewrote it as \(-\frac{\sigma^{-2}}{2} S\), that negative sign carried over. So, the derivative of \(-\frac{\sigma^{-2}}{2} S\) is actually \(-(-2)\frac{\sigma^{-3}}{2} S = \sigma^{-3} S\).

Okay, now let’s substitute \(S\) back to its original verbose form \(\sum_{i=1}^{N}{(y_i - M)^2}\):

\[ \boxed{\frac{\partial \log L}{\partial \sigma} = -\frac{N}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{N}{(y_i - M)^2}} \]

And there you have it! The gradient with respect to \(\sigma\). This tells us how the log-likelihood responds to changes in the variance, the level of ‘random’ noise in our data.

These gradient functions? They’re our guides! We can plug in current parameter values, calculate the gradient, and it points us in the direction that increases our log-likelihood the fastest (or decreases the negative log-likelihood, same diff). This is crucial for fancy optimization tricks like Gradient Descent or the even cooler Hamiltonian Monte Carlo, which is like exploring the parameter space by pretending it’s a physical system (totally normal, right?).

Next up? We gotta get really specific and calculate the partial derivatives of that complex \(M\) function, the Castillo-Aguilar RRi-vs-time model itself, with respect to each of its seven parameters. It’s like breaking down the boss battle into figuring out each of the boss’s attack patterns.

Estimation of RRi-vs-Time Gradient

Alright, you magnificent data wranglers! We’ve conquered the log-likelihood mountain, figured out how it reacts to overall mean and variance changes. Now, it’s time to tackle the real boss battle: our specific model function, \(f(t \mid \theta_k)\). This is the engine, the secret formula, the source code that generates the signal we think we’re seeing.

Let’s gaze upon its intricate glory once more:

\[ f(t \mid \theta_k) = \alpha + \frac{\beta}{1 + e^{\lambda (t - \tau)}} - \frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}} \]

This isn’t just a function; it’s a carefully choreographed dance between seven mysterious parameters, \(\theta_k\). To understand how to change this dance to better match our data, we need its gradient. Think of the gradient \(\nabla f(t \mid \theta_k)\) as a team of specialists, each one telling us exactly how the function changes if only their assigned parameter is tweaked. It’s a column vector, a vertical lineup of partial derivatives, ready to report for duty:

\[ \nabla f(t \mid \theta_k) = \begin{bmatrix} \frac{\partial f(t \mid \theta_k)}{\partial \alpha} \\ \frac{\partial f(t \mid \theta_k)}{\partial \beta} \\ \frac{\partial f(t \mid \theta_k)}{\partial c} \\ \frac{\partial f(t \mid \theta_k)}{\partial \lambda} \\ \frac{\partial f(t \mid \theta_k)}{\partial \phi} \\ \frac{\partial f(t \mid \theta_k)}{\partial \tau} \\ \frac{\partial f(t \mid \theta_k)}{\partial \delta} \end{bmatrix} \]

To assemble this Voltron of derivatives, we gotta derive each piece separately. As they say in all the motivational posters, “Every journey begins with a single step”. Or in this case, “Every gradient vector begins with a single, possibly terrifying, partial derivative”. Let’s dive in!

Deriving the RRi-vs-Time Equations: One Parameter at a Time

Deriving the alpha (\(\alpha\)) parameter: The Chill Constant

Ah, \(\alpha\). The easiest member of the crew. \(\alpha\) just sits there, adding a constant offset to the whole function. It’s the baseline, the “minimum effort” part of the equation. Since it’s just a constant hanging out, not being multiplied or exponentiated by any variables we care about in this context, its derivative with respect to itself is delightfully simple. It’s like asking how fast “5” changes with respect to “5”, it just is. (Okay, technically the derivative of \(ax\) with respect to \(x\) is \(a\), and here \(a=1\) and \(x=\alpha\) if you squint, or even simpler, the derivative of \(x\) with respect to \(x\) is 1).

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \alpha} = 1} \]

Seriously, that’s it. Don’t overthink it. Enjoy this moment, it won’t last.

Deriving the beta (\(\beta\)) parameter: The Amplitude Multiplier

\(\beta\) is a bit more involved. It scales both the initial drop and the recovery components of our function. It’s like the volume knob for the main events. Since we’re deriving with respect to \(\beta\), anything not involving \(\beta\) (like \(\alpha\)) vanishes like a bad Wi-Fi signal. We treat the rest of the terms that are connected to \(\beta\) as constants for this operation.

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \beta} = \frac{1}{1 + e^{\lambda (t - \tau)}} - \frac{c}{1 + e^{\phi (t - \tau - \delta)}}} \]

Essentially, we just peeled \(\beta\) away, leaving the structures it was multiplying behind. Not too shabby.

Deriving the \(c\) parameter: The Recovery Attenuator

The parameter \(c\) is less ambitious; it only messes with the recovery part of the function. It scales how much of the \(\beta\)-driven recovery actually happens. Deriving with respect to \(c\) means the first part of the function (the initial drop) disappears, and we’re left with just the recovery term, with \(c\) stripped away, but keeping its sign and the \(\beta\) it was attached to.

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial c} = - \frac{\beta}{1 + e^{\phi (t - \tau - \delta)}}} \]

See? \(c\) was just multiplying that second fraction, so we just removed it. Like a bad habit.

Deriving the lambda (\(\lambda\)) parameter: The Exercise Rate Boss

Alright, the fun begins. \(\lambda\) lives inside the exponent of the first logistic term. This is where things get spicy and the Chain Rule comes knocking, demanding tribute. Deriving functions that look like \(\frac{A}{1+e^{g(x)}}\) is a classic move. A little known calculus hack is that the derivative is \(-A \frac{g'(x)}{(1+e^{g(x)})^2}\). Our \(A\) is \(\beta\), and our \(g(x)\) is \(\lambda (t - \tau)\), with \(x\) being \(\lambda\). The derivative of \(\lambda (t - \tau)\) with respect to \(\lambda\) is simply \((t - \tau)\) (since \(t\) and \(\tau\) are treated as constants here).

Let’s apply the magic formula to the \(\beta\) term:

\[ \frac{\partial}{\partial \lambda} \left(\frac{\beta}{1 + e^{\lambda (t - \tau)}}\right) = -\beta \frac{\text{derivative of } (\lambda (t - \tau)) \text{ w.r.t. } \lambda}{(1 + e^{\lambda (t - \tau)})^2} \]

The derivative of \(\lambda (t - \tau)\) with respect to \(\lambda\) is \((t-\tau)\). So:

\[ \frac{\partial f(t \mid \theta_k)}{\partial \lambda} = -\beta \frac{(t - \tau) e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} \]

And rearranging it slightly for maximum aesthetic appeal (or maybe just convention):

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \lambda} = -\beta (t - \tau) \frac{e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2}} \]

That looks intimidating, but it’s just chain rule and quotient rule doing a dance.

Deriving the phi (\(\phi\)) parameter: The Recovery Rate Controller

\(\phi\) is \(\lambda\)’s partner in crime, living inside the exponent of the second logistic term (the recovery part). The process is exactly the same as for \(\lambda\), just applied to the second term of the original function. Here, our constant multiplier is \(-c\beta\), and the exponent term is \(\phi (t - \tau - \delta)\). The derivative of \(\phi (t - \tau - \delta)\) with respect to \(\phi\) is \((t - \tau - \delta)\).

Applying the same derivative shortcut to the \(-c\beta\) term:

\[ \frac{\partial}{\partial \phi} \left(-\frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}}\right) = -(-c\beta) \frac{\text{derivative of } (\phi (t - \tau - \delta)) \text{ w.r.t. } \phi}{(1 + e^{\phi (t - \tau - \delta)})^2} \]

The negative signs cancel (yay!), and the derivative of the exponent is \((t - \tau - \delta)\):

\[ \frac{\partial f(t \mid \theta_k)}{\partial \phi} = c\beta \frac{(t - \tau - \delta) e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \]

And arranged nicely:

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \phi} = c\beta(t - \tau - \delta) \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2}} \]

Looking good! Two rate parameters down.

Deriving the tau (\(\tau\)) parameter: The Time Shifter (Both Ways!)

Okay, \(\tau\) is a bit of a double agent. It appears in both logistic terms, shifting them along the time axis. This means we have to apply our derivation trick to both parts of the original function and then add the results. Remember the derivative of \(\lambda (t - \tau)\) with respect to \(\tau\) is \(-\lambda\) (because of the negative sign in front of \(\tau\)). Similarly, the derivative of \(\phi (t - \tau - \delta)\) with respect to \(\tau\) is \(-\phi\).

Applying the derivative rule to \(\frac{A}{g(x)}\) gives you \(-A \frac{g'(x)}{g(x)^2}\) (where \(A\) is \(\beta\) and \(g(x)\) is \(1 + e^{\lambda (t - \tau)}\)):

\[ \frac{\partial}{\partial \tau} \left(\frac{\beta}{1 + e^{\lambda (t - \tau)}}\right) = -\beta \frac{-\lambda e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} \]

The negative signs cancel, leaving: \(\beta \lambda \frac{e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2}\).

Now applying it to the second term (where the constant is \(-c\beta\) and exponent derivative is \(-\phi\)):

\[ \frac{\partial}{\partial \tau} \left(-\frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}}\right) = -(-c\beta) \frac{-\phi e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \]

Two negative signs cancel from the \(-(-c\beta)\), leaving \(c\beta\). The \(-\phi\) remains. So we get: \(-c\beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2}\).

Adding the results from both terms gives us the derivative with respect to \(\tau\):

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \tau} = \beta \lambda \frac{e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} - c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2}} \]

This one is a bit of a mouthful, reflecting \(\tau\)’s influence on both parts of the function.

Deriving the delta (\(\delta\)) parameter: The Recovery Lag

Finally, \(\delta\). This parameter only appears in the second logistic term, specifically inside the exponent, shifting the recovery phase. It’s the lag time, the “oof, I need a moment” parameter. The process is exactly like deriving with respect to \(\phi\), but now we’re deriving \(\phi (t - \tau - \delta)\) with respect to \(\delta\). The derivative of \(\phi (t - \tau - \delta)\) with respect to \(\delta\) is \(-\phi\).

Applying the derivative rule to the second term, with constant \(-c\beta\) and exponent derivative \(-\phi\):

\[ \frac{\partial}{\partial \delta} \left(-\frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}}\right) = -(-c\beta) \frac{-\phi e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \]

Again, two negative signs cancel (\(c\beta\)), leaving the \(-\phi\).

\[ \boxed{\frac{\partial f(t \mid \theta_k)}{\partial \delta} = -c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2}} \]

And there you have it! The derivative with respect to \(\delta\). Notice how similar it is to the \(\tau\) part of the recovery term derivative? They’re buddies in shifting that recovery curve.

We did it! We calculated every single partial derivative. Now we can assemble our magnificent gradient vector, the ultimate cheat sheet for navigating the parameter space of our model:

\[ \nabla f(t \mid \theta_k) = \begin{bmatrix} \frac{\partial f(t \mid \theta_k)}{\partial \alpha} \\ \frac{\partial f(t \mid \theta_k)}{\partial \beta} \\ \frac{\partial f(t \mid \theta_k)}{\partial c} \\ \frac{\partial f(t \mid \theta_k)}{\partial \lambda} \\ \frac{\partial f(t \mid \theta_k)}{\partial \phi} \\ \frac{\partial f(t \mid \theta_k)}{\partial \tau} \\ \frac{\partial f(t \mid \theta_k)}{\partial \delta} \end{bmatrix} = \begin{bmatrix} 1 \\ \frac{1}{1 + e^{\lambda (t - \tau)}} - \frac{c}{1 + e^{\phi (t - \tau - \delta)}} \\ - \frac{\beta}{1 + e^{\phi (t - \tau - \delta)}} \\ -\beta \frac{(t - \tau) e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} \\ c\beta(t - \tau - \delta) \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \\ \beta \lambda \frac{e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} - c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \\ -c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \end{bmatrix} \]

This magnificent… thing… is the full gradient of our RRi-vs-time model. Each component tells us how sensitive the model’s output is to a tiny change in its corresponding parameter at any given time \(t\). It’s the model’s instruction manual for improvement! Now we can use this gradient, combined with the log-likelihood gradient we found earlier, to explore the parameter space efficiently using methods like the aforementioned (and totally awesome) Hamiltonian Monte Carlo. The math beast has been poked, and it has shown us its secrets!

Expansion of the Gradient of \(\log L\)

Okay, team! We’ve got the individual pieces of the puzzle: how the log-likelihood changes when we mess with the mean (\(M\)) and variance (\(\sigma\)), and how our model (\(f(t \mid \theta_k)\)) changes when we mess with its parameters (\(\theta_k\)). Now, it’s time to assemble these pieces into the ultimate weapon: the full gradient of the log-likelihood function with respect to all the parameters we care about (the \(\theta_k\) gang and \(\sigma\)). This is the map that shows us the steepest path through our parameter landscape.

This gradient is the secret sauce, the flux capacitor, the whole reason we went through all that calculus pain. Why? Because it’s what fuels fancy optimization algorithms, particularly Hamiltonian Monte Carlo (HMC). In HMC, we pretend our parameters are tiny particles (\(q\), the “position” in parameter space) with momentum (\(p\)). The gradient of the log-likelihood (or technically, the negative log-likelihood, which HMC calls the “potential energy” \(U\)) is the “force” that updates this imaginary particle’s momentum. It’s like playing a very complicated game of physics in a multi-dimensional space, all to find the best parameter values.

To sound fancy and appease the ghost of Hamilton (the mathematician, not the musical!), we’ll call the gradient of the log likelihood \(\nabla U(q)\). Because \(q\) represents our parameters (our “coordinates” in the parameter space), and \(\nabla U(q)\) is the force vector that will tell our parameter-particle where to go next. Physics! Woo!

Terminology Clarification: Don’t Cross the Streams!

Okay, quick reality check before the physics gets too wild. In HMC lore, the potential energy \(U(q)\) is usually defined as \(U(q) = -\log L(q) - \log \text{prior}(q)\). It’s the negative log-posterior probability, basically. We’re focusing on the gradient of the log-likelihood part, \(\nabla \log L(q)\), because, frankly, uniform priors are sometimes simpler and less scary for a first go. Just remember, if you decide to use non-uniform priors later (because you’re feeling brave or someone makes you), the actual force vector you need is \(\nabla U(q) = -\nabla \log L(q) - \nabla \log \text{prior}(q)\). Don’t say we didn’t warn you.

Let’s revisit the gradients we previously derived, the ones that told us how the log-likelihood changes with respect to the mean (\(M\)) and the variance (\(\sigma\)). Remember these champs?

The gradient with respect to \(M\):

\[\boxed{\frac{\partial \log L}{\partial M} = \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M) \cdot \frac{\partial M}{\partial \theta_k}}}\]

And the gradient with respect to \(\sigma\):

\[\boxed{\frac{\partial \log L}{\partial \sigma} = -\frac{N}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{N}{(y_i - M)^2}}\]

So, if we were just looking at \(M\) and \(\sigma\) as the parameters, our initial gradient vector for \(\nabla U(q) = -\nabla \log L(q)\) would look something like this, just slapping a negative sign in front because HMC uses negative log-likelihood for potential energy (don’t ask, it’s a physics thing):

\[ \nabla U(q) = -\nabla \log L(q) = -\begin{bmatrix} \frac{\partial \log L}{\partial M} \\ \frac{\partial \log L}{\partial \sigma} \end{bmatrix} = -\begin{bmatrix} \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M) \cdot \frac{\partial M}{\partial \theta_k}} \\ -\frac{N}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{N}{(y_i - M)^2} \end{bmatrix} \]

BUT WAIT! There’s a catch! The parameters we actually want to estimate are the \(\theta_k\) parameters inside \(M\), plus \(\sigma\). So, the gradient needs to have an entry for each of those parameters: \(\alpha, \beta, c, \lambda, \phi, \tau, \delta\), and \(\sigma\). We need to expand that \(\frac{\partial \log L}{\partial M}\) term for each of the \(\theta_k\) parameters using the partial derivatives of our model \(M\) (or \(f(t \mid \theta_k)\), remember we agreed \(M\) is just shorthand for that fancy function).

Remember that giant vector \(\nabla f(t \mid \theta_k)\) we just slayed? Each element in that vector is \(\frac{\partial M}{\partial \theta_k}\) for a specific \(\theta_k\). We plug those into the formula for \(\frac{\partial \log L}{\partial M}\) for each parameter entry.

[Just a quick note for those keeping score: yes, \(M\) is \(f(t \mid \theta_k)\). So when you see \(\frac{\partial M}{\partial \theta_k}\) below, just picture that specific element from the \(\nabla f(t \mid \theta_k)\) vector we derived earlier. It’s like calling a character by their nickname vs. their full government name.]

Incorporating all those glorious partial derivatives of \(M\) into the \(\frac{\partial \log L}{\partial M}\) term for each \(\theta_k\), and adding the \(\sigma\) term at the end, the actual gradient \(\nabla U(q)\) that we feed into HMC looks like this magnificent beast:

\[ \nabla U(q) = -\nabla \log L(q) = -\begin{bmatrix} \frac{\partial \log L}{\partial \alpha} \\ \frac{\partial \log L}{\partial \beta} \\ \frac{\partial \log L}{\partial c} \\ \frac{\partial \log L}{\partial \lambda} \\ \frac{\partial \log L}{\partial \phi} \\ \frac{\partial \log L}{\partial \tau} \\ \frac{\partial \log L}{\partial \delta} \\ \frac{\partial \log L}{\partial \sigma} \end{bmatrix} = -\begin{bmatrix} \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \alpha}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \beta}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial c}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \lambda}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \phi}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \tau}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \delta}} \\ -\frac{N}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{N}{(y_i - M_i)^2} \end{bmatrix} \]

This vector, my friends, is the crown jewel. It tells us, for any given set of parameter values (\(q\)), which way is “uphill” on the log-likelihood surface (or “downhill” on the potential energy surface), and how steep that slope is for each parameter. It’s our guiding star for exploring the parameter space and finding the values that best explain the data.

The calculus part is officially done! We’ve wrestled the derivatives, expanded the gradients, and emerged… well, maybe not unscathed, but victorious. The final boss of math is defeated. Now for the real challenge: translating these beautiful, hand-derived equations into functional R code that won’t just crash and burn. Wish us luck. We’re gonna need it.

Translating Math Into Code

Alright, we survived the calculus-induced fever dream! We’ve got equations that look like they were scribbled by a caffeinated wizard. But as any seasoned adventurer knows, having a map isn’t enough; you need to, you know, actually walk the path. And in the digital realm, “walking the path” means translating this beautiful, terrifying math into cold, hard code. Time to pull these fancy derivatives out of their ivory tower and force them to do something in R.

This “something” is giving our imaginary particle (the one that represents our parameters) directions. We’ll use the negative gradient of the log-likelihood (our \(-\nabla U(q)\)) as the force pushing this particle around the parameter space, hopefully towards areas of higher probability, carried by its glorious momentum \(p\). It’s like digital parkour across a probability landscape!

RRi-vs-Time Model: The Core Engine

First things first, we need the actual function that generates the “true” signal (\(M\)). This is the Castillo-Aguilar equation, the heart of our model. It takes time (\(t\)) and our list of parameters (\(\theta_k\)) and spits out the predicted RRi value at that time. This is crucial for calculating residuals later (because we need to know how far our predicted value is from the observed data point, like measuring how bad our model’s aim is).

Let’s remind ourselves of the star of the show:

\[ \boxed{f(t \mid \theta_k) = \alpha + \frac{\beta}{1 + e^{\lambda (t - \tau)}} - \frac{c\beta}{1 + e^{\phi (t - \tau - \delta)}}} \\ \text{where:}~k\in\{\alpha,\beta,c,\lambda,\phi,\tau,\delta\} \]

Where t will be our time vector (the input data) and params will be a named list containing our \(\theta_k\) values. Time to see this algebraic celebrity make its debut in R:

Code
# ==============================================
# 1. RRi Model (Castillo-Aguilar Equation)
# ==============================================
rr_model <- function(t, params) {
  alpha <- params$alpha
  beta <- params$beta
  c <- params$c
  lambda <- params$lambda
  phi <- params$phi
  tau <- params$tau
  delta <- params$delta
  
  term2 <- beta / (1 + exp(lambda * (t - tau)))
  term3 <- (-c * beta) / (1 + exp(phi * (t - tau - delta)))
  alpha + term2 + term3
}

Voila! A direct translation. Now, does it actually work? Let’s put it to the test by simulating some data that looks vaguely like the chaos seen during exercise and recovery, and then visualize it. This is where we see if our code is a masterpiece or a hot mess.

Code
# Time vector
t <- seq(0.01, 12, 0.01) 

# list with RRi parameters
params <- list(alpha = 800, beta = -300, c = 0.9,
               lambda = -3, phi = -2, 
               tau = 5, delta = 2)

# list with parameters for SD dynacmis of RRi over time
params_sd <- list(alpha = 30, beta = -30, c = 0.7,
               lambda = -1, phi = -0.8, 
               tau = 5, delta = 3)

# Simulate the "true" (smooth) RRi curve
y_true <- rr_model(t, params)
# And the fluctuations for RRi SD over time
y_sd <- rr_model(t, params_sd)

# And simulate the observed RRi signal
y <- y_true + rnorm(n = length(t), sd = y_sd)

ggplot() +
  geom_line(aes(t, y), linewidth = 1/4, color = "purple") +
  geom_line(aes(t, y_true), linewidth = 1, color = "purple") +
  geom_line(aes(t, y_true + y_sd), linewidth = 1/2, linetype = 2, color = "purple") +
  geom_line(aes(t, y_true - y_sd), linewidth = 1/2, linetype = 2, color = "purple") +
  geom_vline(xintercept = c(5, 7), linetype = 2, color = "gray50") +
  annotate("text", label = "Rest", x = 3.5, y = 650, color = "gray50") +
  annotate("text", label = "Exercise", x = 6, y = 750, color = "gray50") +
  annotate("text", label = "Recovery", x = 8.5, y = 650, color = "gray50") +
  labs(y = "RRi (ms)", x = "Time (min)", 
       title = "Exercise-Induced RRi Dynamics",
       subtitle = "Cardiac Autonomic Modulation")

Look at that! Our model can indeed simulate those dramatic shifts in RRi during rest, exercise, and recovery. Even better, we can use a second instance of the same model (with different parameters) to simulate the changing standard deviation over time. See how the noisy purple line (the “observed” data) gets wider during the exercise phase and then narrows down? That’s the time-varying \(\sigma\) at work! It’s like simulating not just the main performance but also the chaotic energy of the crowd.

Log Likelihood Function: Scoring Our Parameter Guesses

Next up, the log-likelihood function. This is our scoring system. Given a set of parameter values, it tells us (on a log scale, because math) how probable our observed data is. Higher log-likelihood? Good parameters! Lower? Probably not the best fit. This function is key to knowing if our imaginary particle is climbing the right hill.

Let’s recall the equation we sweated over:

\[ \boxed{\log L(y_i \mid M, \sigma^2) = -\frac{N}{2} \log(2\pi\sigma^2) - \sum_{i = 1}^{N}{\frac{(y_i - M)^2}{2\sigma^2}}} \]

Now, let’s give this “bad boy” a proper R implementation:

Code
# ==============================================
# 2. Gaussian Log-Likelihood
# ==============================================
log_likelihood <- function(y, t, params, sigma) {
  M <- rr_model(t, params)
  residuals <- y - M
  N <- length(residuals)
  sigma_squared <- sigma^2
  
  -N/2 * log(2 * pi * sigma_squared) - sum(residuals^2) / (2 * sigma_squared)
}

Simple enough! It takes the observed data (y, t), the model parameters (params), and the crucial sigma value, and gives us back a single number: the log-likelihood score.

To really understand what this function does, let’s try computing the log-likelihood for a range of two parameters (alpha and c) while keeping the others fixed. This lets us visualize a small slice of the parameter space as a 3D surface.

Code
# Same parameters as before

# Observed signal
y <- y_true + rnorm(n = length(t), sd = 30)

# Input parameters
param_data <- expand.grid(
  alpha = seq(710, 890, length.out = 100),
  c = seq(0.3, 1.5, length.out = 100)
)

# Compute log likelihood
param_data$log_likelihood <- 
  apply(param_data, 1, function(x) {
    new_params <- within(params, {
      alpha <- x[["alpha"]];
      c <- x[["c"]]
    }) 
    log_likelihood(y, t, new_params, sigma = 20)
  })

param_data <- as.data.table(param_data)


# Get unique alpha and c values for axes
z_vals <- as.matrix(dcast(param_data, alpha ~ c, value.var = "log_likelihood")[,-1])
alpha_vals <- unique(param_data$alpha)
c_vals <- unique(param_data$c)

# Create the 3D surface plot
library(plotly)

plot_ly() |> 
  add_trace(
    x = c_vals,
    y = alpha_vals,
    z = z_vals,
    type = "surface",
    colorscale = "Viridis",
    showscale = TRUE
  ) |> 
  layout(
    scene = list(
      xaxis = list(title = "C (Recovery proportion)"),
      yaxis = list(title = "Alpha parameter"),
      zaxis = list(title = "Log-Likelihood"),
      camera = list(eye = list(x = -1, y = -2.5, z = 0.5))
    ),
    title = "Log-Likelihood Landscape for alpha and c Parameters"
  )

Behold! A wild log-likelihood landscape appears! You can see hills and valleys. Optimization algorithms basically try to find the peak of the highest hill on this surface. This visualization is great for 2 parameters, but imagine doing this for all 8 parameters simultaneously. The number of points you’d need on a grid explodes faster than a poorly mixed science experiment.

I don’t have access to a quantum computer to crunch the numbers for the whole probability landscape without my computer bursting into flames

Exactly. Grid approximation for high-dimensional spaces? Not gonna happen unless you have hardware from the future. This is why gradient-based methods are essential, they use the slope (the gradient!) to figure out which way is up without having to check every single possible location.

Gradient of the RRi-vs-time model: The Model’s Sensitivity Report

Before we can calculate the full gradient of the log-likelihood, we need the gradients of the model itself with respect to each of its parameters. This tells us how sensitive the model’s output is to tiny changes in \(\alpha\), \(\beta\), \(\lambda\), etc., at any given moment in time.

Let’s recall that glorious (and slightly terrifying) vector of partial derivatives:

\[ \nabla f(t \mid \theta_k) = \begin{bmatrix} \frac{\partial f(t \mid \theta_k)}{\partial \alpha} \\ \frac{\partial f(t \mid \theta_k)}{\partial \beta} \\ \frac{\partial f(t \mid \theta_k)}{\partial c} \\ \frac{\partial f(t \mid \theta_k)}{\partial \lambda} \\ \frac{\partial f(t \mid \theta_k)}{\partial \phi} \\ \frac{\partial f(t \mid \theta_k)}{\partial \tau} \\ \frac{\partial f(t \mid \theta_k)}{\partial \delta} \end{bmatrix} = \begin{bmatrix} 1 \\ \frac{1}{1 + e^{\lambda (t - \tau)}} - \frac{c}{1 + e^{\phi (t - \tau - \delta)}} \\ - \frac{\beta}{1 + e^{\phi (t - \tau - \delta)}} \\ -\beta \frac{(t - \tau) e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} \\ c\beta(t - \tau - \delta) \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \\ \beta \lambda \frac{e^{\lambda (t - \tau)}}{(1 + e^{\lambda (t - \tau)})^2} - c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \\ -c \beta \phi \frac{e^{\phi (t - \tau - \delta)}}{(1 + e^{\phi (t - \tau - \delta)})^2} \end{bmatrix} \]

Yeah, it still looks like a lot. But trust me, implementing it in R is much less painful than deriving it by hand. We just translate each line of the vector into code.

Code
# ==============================================
# 3. Partial Derivatives of the RRi Model
# ==============================================
compute_partials <- function(t, params) {
  alpha <- params$alpha
  beta <- params$beta
  c <- params$c
  lambda <- params$lambda
  phi <- params$phi
  tau <- params$tau
  delta <- params$delta
  
  # Precompute common terms
  t_tau <- t - tau
  t_tau_delta <- t - tau - delta
  
  exp_lambda <- exp(lambda * t_tau)
  exp_phi <- exp(phi * t_tau_delta)
  denom1 <- (1 + exp_lambda)^2
  denom2 <- (1 + exp_phi)^2
  
  list(
    d_alpha = rep(1, length(t)),             # Is constant, so we repeat 1's
    d_beta = 1/(1 + exp_lambda) - c/(1 + exp_phi), # ∂M/∂β
    d_c = -beta / (1 + exp_phi),             # ∂M/∂c
    d_lambda = -beta * t_tau * exp_lambda / denom1, # ∂M/∂λ
    d_phi = c * beta * t_tau_delta * exp_phi / denom2, # ∂M/∂φ
    d_tau = beta * lambda * exp_lambda / denom1 - c * beta * phi * exp_phi / denom2, # ∂M/∂τ
    d_delta = -c * beta * phi * exp_phi / denom2 # ∂M/∂δ
  )
}

This function returns a list, where each element is a vector showing how the RRi changes over time for a unit change in that specific parameter. It’s like getting a separate sensitivity report for each control knob on our model.

Let’s visualize these partial derivatives to see how a tiny nudge to each parameter affects the RRi curve across time:

Code
partials <- 
  compute_partials(t, params) |> 
  as.data.table()

names(partials) <- 
  c("alpha", "beta", "c", 
    "lambda", "phi", "tau", "delta")

data <- 
  cbind(time = t, partials) |> 
  melt.data.table(id.vars = "time")

ggplot(data, aes(time, value, col = variable)) +
  facet_grid(rows = vars(variable), scales = "free_y", 
             labeller = label_parsed) +
  geom_line(show.legend = FALSE, linewidth = 1) +
  geom_vline(xintercept = c(5, 7), linetype = 2, color = "gray50") +
  labs(x = "Time (min)", y = "Effect on RRi (ms)",
       title = "Effect of parameter change on RRi",
       subtitle = "Change in RRi for a unit change for each parameter")

This plot is super informative! It shows that changing \(\alpha\) just shifts the whole curve up or down (constant effect). Changing \(\beta\) has a big effect during exercise/recovery transitions. Changing the time parameters (\(\tau\), \(\delta\)) shifts where those transitions happen. The rate parameters (\(\lambda\), \(\phi\)) affect the speed of the transitions. This is the magic of partial derivatives, they disentangle the influence of each part of the system! It’s like getting an x-ray of our model’s inner workings.

Gradient of the Log Likelihood Function: The Master Gradient

Okay, the moment of truth! We combine the log-likelihood formula with the model’s partial derivatives to get the full gradient vector \(\nabla U(q)\). This is the vector of slopes for all our parameters (the 7 \(\theta_k\)’s and \(\sigma\)) on the log-likelihood surface. This is what tells our HMC particle where to move.

Remember this monster?

\[ \nabla U(q) = -\nabla \log L(q) = -\begin{bmatrix} \frac{\partial \log L}{\partial \alpha} \\ \frac{\partial \log L}{\partial \beta} \\ \frac{\partial \log L}{\partial c} \\ \frac{\partial \log L}{\partial \lambda} \\ \frac{\partial \log L}{\partial \phi} \\ \frac{\partial \log L}{\partial \tau} \\ \frac{\partial \log L}{\partial \delta} \\ \frac{\partial \log L}{\partial \sigma} \end{bmatrix} = -\begin{bmatrix} \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \alpha}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \beta}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial c}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \lambda}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \phi}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \tau}} \\ \frac{1}{\sigma^2} \sum_{i = 1}^{N}{ (y_i - M_i) \cdot \frac{\partial M_i}{\partial \delta}} \\ -\frac{N}{\sigma} + \frac{1}{\sigma^3} \sum_{i=1}^{N}{(y_i - M_i)^2} \end{bmatrix} \]

Implementing this in R means calling our rr_model to get \(M\), calculating the residuals \((y_i - M_i)\), calling our compute_partials function to get all the \(\frac{\partial M_i}{\partial \theta_k}\) terms, and then plugging everything into the formulas derived earlier. Don’t forget that negative sign out front because HMC is weird and uses potential energy (\(-\log L\))!

Code
# ==============================================
# 4. Gradient of Log-Likelihood (∇U(q))
# ==============================================
grad_U <- function(y, t, params, sigma) {
  M <- rr_model(t, params)
  residuals <- y - M
  partials <- compute_partials(t, params)
  N <- length(residuals)
  
  # Precompute inverses
  sigma_sq_inv <- 1 / sigma^2
  sigma_cu_inv <- 1 / sigma^3
  
  # Gradients for model parameters
  grad_params <- list(
    alpha = sum(residuals * partials$d_alpha) * sigma_sq_inv,
    beta = sum(residuals * partials$d_beta) * sigma_sq_inv,
    c = sum(residuals * partials$d_c) * sigma_sq_inv,
    lambda = sum(residuals * partials$d_lambda) * sigma_sq_inv,
    phi = sum(residuals * partials$d_phi) * sigma_sq_inv,
    tau = sum(residuals * partials$d_tau) * sigma_sq_inv,
    delta = sum(residuals * partials$d_delta) * sigma_sq_inv
  )
  
  # Gradient for sigma
  grad_sigma <- (-N / sigma) + sum(residuals^2) * sigma_cu_inv
  
  # Return -∇logL(q)
  gradLL <- c(grad_params, list(sigma = grad_sigma))
  
  lapply(gradLL, function(x) -x)
}

This function grad_U is the engine’s computer. It takes the current parameter values (our particle’s position \(q\)) and spits out the gradient vector (the force) at that exact location. This force vector is what we use to update the particle’s momentum, which then updates its position. It’s a continuous cycle of movement driven by the probability landscape’s slope.

Leapfrog Integrator: The Particle Simulator

Now that we have the force function (grad_U), we need an algorithm to simulate the particle’s movement through the parameter space using that force. This is where symplectic integrators like the leapfrog integrator come in. They are specifically designed to preserve the properties of Hamiltonian systems (like ours!) over time, making the simulation stable and efficient.

I’ve already bored you with the leapfrog integrator in previous blog posts, but a quick reminder never hurt anyone (much). It’s a three-step dance:

  1. Give the particle a random kick (initialize momentum \(p\), typically from a standard normal, one for each parameter in \(q\)): \[p \leftarrow \mathcal{N}(n \mid 0, 1), \text{ where: } n = \text{length}(q)\] (Make sure your random momentum vector is the same size as your parameter vector \(q\), one \(p\) for every \(q\)!)

  2. Update the momentum halfway based on the current gradient (the force): \[p_{t+0.5} \leftarrow p_{t} - \nabla U(q_{t}) \cdot \frac{\text{stepsize}}{2}\]

  3. Update the position using this new, half-updated momentum: \[q_{t+1} \leftarrow q_{t} + p_{t+0.5} \cdot \text{stepsize}\]

  4. Update the momentum the other half of the way using the gradient at the new position: \[p_{t+1} \leftarrow p_{t+0.5} - \nabla U(q_{t+1}) \cdot \frac{\text{stepsize}}{2}\]

First half-step momentum updateFull-step position updateSecond half-step momentum update

And then you just repeat steps 2-4 for a set number of “steps”. The complicated part wasn’t the algorithm; it was getting that \(\nabla U(q)\) function right!

Here’s the leapfrog algorithm compacted into a neat R function:

Code
leapfrog <- function(q, p, dt, y, t) {
  # Extract sigma and model parameters
  sigma <- q$sigma
  params <- q[names(q) != "sigma"]
  
  # Compute gradient
  grad <- grad_U(y, t, params, sigma)
  
  # Update momentum
  p <- p - 0.5 * dt * unlist(grad)
  
  # Update position
  q_new <- as.list(x = unlist(q) + dt * p)
  
  # Recompute gradient at new position
  new_sigma <- q_new$sigma
  new_params <- q_new[names(q_new) != "sigma"]
  grad_new <- grad_U(y, t, new_params, new_sigma)
  
  # Final momentum update
  p <- p - 0.5 * dt * unlist(grad_new)
  
  list(q = q_new, p = p)
}

This function takes the current position (q), momentum (p), a small step size (dt), and the data (y, t), and returns the particle’s new position and momentum after one leapfrog step.

Now, let’s set up a simulation and watch our imaginary particle bounce around the parameter space!

Code
# Initial parameters ----
## Castillo-Aguilar's model
params <- list(
  alpha = 800, beta = -300, c = 0.8,
  lambda = -2, phi = -1, tau = 6, delta = 3
)
## Parameter for the standard deviation
sigma <- 10

# Data ----
## Time
t <- seq(0, 15, length.out = 1000)
## Observed RRi
y <- rr_model(t, params) + rnorm(1000, sd = sigma)

# Parameters for simulation ----
## Number of steps
steps <- 10000
## Stepsize at each step
stepsize <- 0.00025
## Random seed
set.seed(1234)

## Initiate random momentum vector
p <- rnorm(n = 8, mean = 0, sd = 1)

## Parameter
q <- c(params, list(sigma = sigma))
out <- vector("list", length = steps)

## Leapfrog integrator
for (step in 1:steps) {
  out[[step]] <- leapfrog(q = q, p = p, dt = stepsize, y = y, t = t)
  q <- out[[step]]$q
  p <- out[[step]]$p
}

We set up initial parameter values (q), give it a random initial kick (p), choose a tiny step size (stepsize), and let the leapfrog function run for many steps. Each step calculates the gradient, updates momentum, updates position, calculates the new gradient, and updates momentum again. It’s like a tiny, energetic dance!

To see if the simulation is behaving properly, we plot the phase space: momentum (\(p\)) vs. position (\(q\)) for each parameter over time. For a well-tuned HMC run, you expect to see cyclic behavior, like a pendulum swinging, indicating the particle is exploring the parameter space efficiently.

Code
position <- lapply(out, function(x) {
  as.data.table(x$q)
}) |> rbindlist()

momentum <- lapply(out, function(x) {
  as.list(x$p)
}) |> rbindlist()

position$step <- 1:steps
momentum$step <- 1:steps

position <- melt.data.table(position, id.vars = "step", value.name = "p")
momentum <- melt.data.table(momentum, id.vars = "step", value.name = "q")

sim_data <- position[momentum, on = c("step", "variable")]
sim_data[, `:=`(q = scale(q), p = scale(p)), list(variable)]

ggplot(sim_data, aes(p, q)) +
  facet_grid(rows = vars(variable), labeller = label_parsed) +
  geom_path(aes(group = 1, alpha = step, col = variable), show.legend = FALSE) +
  labs(x = "Momentum (p)", y = "Position (q)",
       title = "Phase Space of RRi Model",
       subtitle = "Particle Moving on the Parameter Space",
       caption = "Values are Standardized For Each Parameter")

Look at that beautiful chaos! For most parameters, you see that expected cyclic behavior, like a bunch of pendulums swinging. This means the particle is doing a decent job exploring those dimensions of the parameter space. However, notice the beta parameter? It’s just drifting off into the abyss! It didn’t settle into that nice, contained, cyclic pattern. This is a clear sign that our simulation isn’t perfectly tuned.

Why does this happen? Could be anything! The stepsize might be too big (causing the particle to overshoot optimal regions), the number of steps (steps) might be too few, or the initial conditions might be bad. Debugging HMC simulations and tuning these “hyperparameters” (stepsize, trajectory length) is a whole art form in itself, a topic for yet another blog post! But for now, we’ve successfully translated our hairy math into functional R code that simulates a particle exploring a parameter space driven by the gradient of the log-likelihood. We’re basically playing god with tiny mathematical particles. And that’s pretty cool.

How I Learned to Stop Worrying and Love the Gradient

Alright, peel me off the floor. We made it! After a rollercoaster ride through the dark forests of calculus, the shimmering plains of probability, and the slightly-too-exciting physics playground of Hamiltonian Monte Carlo, we’ve arrived. This wasn’t just a stroll in the park; this was scaling Mount Doom with a spork. But hey, look at the view! (The view is a bunch of complex R functions, but trust me, it’s breathtaking for a stats nerd).

Looking back, there were moments, weren’t there? Like when we first saw that monster equation for \(f(t \mid \theta_k)\) and realized we had to derive its gradient by hand. A cold dread set in, the kind you feel when you realize you’ve promised to assemble IKEA furniture using only the picture on the box. Then came the coding. “Oh joy,” I thought, “translating this algebraic nightmare into something R won’t immediately reject like a bad organ transplant.” Turns out, the coding wasn’t quite as soul-crushing as the derivation itself – like finding out the IKEA instructions actually had some words, even if they were in Swedish. Still, it felt monumental.

But through that pain, clarity emerged. We ripped the black box wide open, Neo-style, dodging algebraic bullets to see the gears turning. Every line of derived math, every function in R, is a piece of the puzzle, showing how the elegant (lol) theory connects to the messy reality of heartbeats.

The real mind-bender? Hamiltonian Monte Carlo. It’s not just an algorithm; it’s a philosophy. It treats our parameters like a particle in a physics simulation, exploring the probability space not by stumbling around blindly like a random-walk zombie, but by using momentum to glide. Imagine Han Solo navigating an asteroid field, expertly using the force (or, in our case, the negative gradient of the log probability) to steer clear of boring low-probability zones and slingshot towards the juicy, high-probability ones. That gradient, \(\nabla \log P(\theta|\text{data})\), is our Millennium Falcon’s hyperdrive, guiding us through the statistical galaxy.

And those parameters in the Castillo-Aguilar model? They stopped being just Greek letters and started looking like character traits. \(\lambda\) and \(\phi\), the rate parameters, are the “get up and go” and “slow your roll” dials of your autonomic nervous system. \(\tau\) and \(\delta\) are the timing cues, the “start workout NOW” and “okay, now you can breathe” signals. Quantifying the uncertainty in these isn’t just academic busywork; it’s potentially how we build the next generation of fitness tech that doesn’t just guess, but knows your body’s recovery needs.

Make no mistake, this path was littered with the digital remains of failed derivative attempts. Manually deriving partials for that nested logistic function? It was like trying to defuse a bomb blindfolded. One sign error, one missed chain rule, and your whole simulation blows up. This is why tools with automatic differentiation are the superheroes of modern probabilistic programming. They have the precision of a brain surgeon’s laser scalpel compared to my rusty butter knife. But doing it by hand? That gives you the battle scars and the deep appreciation for why that laser scalpel is so necessary. You learn how changing \(\beta\), the exercise amplitude knob, doesn’t just affect one part of the likelihood; it sends ripples across the whole surface like dropping a bowling ball into a bowl of very sensitive, multi-dimensional jelly.

And let’s not forget the leapfrog integrator, the trusty engine of our simulation. Getting its parameters right is the ultimate Goldilocks problem: the stepsize can’t be too big (overshoots the peak like a drunk driver), and not too small (takes forever, your computer will stage a protest). You need it just right for that smooth, energy-preserving trajectory. Our \(\beta\) parameter, doing its wild, non-cyclic dance in the phase-space plot, was a stark reminder that sometimes the code is fine, but your tuning is off. Or maybe, just maybe, the data in that recovery phase isn’t giving parameters like \(c\), \(\phi\), and \(\delta\) enough information, letting them wander off together like a statistical crime syndicate.

Choosing the right model feels like deciding whether a bazooka or a Swiss Army knife is better for a zombie apocalypse. The Castillo-Aguilar model hits a sweet spot: complex enough to look like real RRi data, but simple enough that HMC doesn’t require a supercomputer and a prayer. Add more complexity (like trying to model every single heartbeat nuance), and your sampler starts begging for mercy. Bayesian model comparison is the judge in this eternal battle between “more detail!” and “please, can we go home now?”.

And priors! Oh, the priors. In this exercise, with flat priors, they were just background noise. But in the real world, priors from domain experts (like a cardiologist saying, “Yeah, \(\delta\) usually chills in this range”) are like Gandalf showing up just when you think you’re doomed. They guide your sampler away from nonsensical parameter values. Hierarchical models? That’s the whole fellowship, pooling information across individuals to make everyone’s parameter estimates better. HMC’s ability to handle these complex, correlated structures is why we endure the math – it unlocks deeper insights.

Zooming out, this whole pixelated journey underscores the power of modern probabilistic programming frameworks. Stan, PyMC, TFP – they automate the grunt work of differentiation so your brain can focus on the fun stuff: building cool models and interpreting results. But understanding what’s under the hood? Priceless. When your sampler throws a cryptic error, knowing your gradient equations turns you from a clueless bystander into a detective who can actually solve the mystery.

The most beautiful part, the true visual poetry, is in those phase-space plots. They are the Star Wars dogfights of statistics, showing the complex ballet between position and momentum. Interpreting them requires merging your statistical knowledge with a physicist’s intuition. Why is \(\beta\) spiraling? Maybe the data isn’t constraining it enough. Time for more data, or maybe a tighter prior.

Of course, bringing HMC models into the real world adds another layer of boss battles. 10,000 leapfrog steps? Easy peasy for research, a nightmare for real-time patient monitoring. This pushes us towards faster methods, trading off some Bayesian purity for speed. It’s a Faustian bargain, but sometimes necessary.

But the biggest payoff? The transformation in us. Rebuilding HMC from scratch is like learning to ride a unicycle on a tightrope made of spaghetti. It’s terrifying, you fall a lot, but when you finally get it, you understand balance and momentum in a way you never could just by watching. For anyone diving into this stuff, it turns HMC from a magical incantation into an open-book exam. You know why half-steps matter, why volume preservation is key, and why energy conservation keeps things sane.

Looking ahead, the possibilities are endless. Our simple RRi model could spawn sequals: hierarchical models for populations, drug interaction spin-offs, or even hybrid models blending our physiological structure with the pattern-recognition power of neural networks. But with great power comes great responsibility; using a misestimated \(\tau\) to guide someone’s health? Not funny. Ethical guardrails, clear reporting, and robust uncertainty quantification aren’t optional extras; they’re the non-negotiable pillars of this whole enterprise.

In the end, our HMC odyssey through the Castillo-Aguilar model is a microcosm of the scientific adventure itself. We’ve taken equations and turned them into a narrative of heartbeats, battling gradients and leapfrog steps like plot twists. And in the simple rhythm of RR intervals, we’ve found not just numbers, but echoes of biological truth, waiting for us to understand them, one Hamiltonian leap, one carefully coded function, one hard-won derivative at a time.

Citation

BibTeX citation:
@misc{castillo-aguilar2025,
  author = {Castillo-Aguilar, Matías},
  title = {HMC by {Hand:} {Gradients,} {Heartbeats,} {PDEs} and {Other}
    {Tales}},
  date = {2025-04-19},
  url = {https://bayesically-speaking.com/posts/2025-04-19 heartbeats-pdes-and-other-tales/},
  doi = {10.59350/450sp-tdq85},
  langid = {en}
}
For attribution, please cite this work as:
Castillo-Aguilar, Matías. 2025. “HMC by Hand: Gradients, Heartbeats, PDEs and Other Tales.” April 19, 2025. https://doi.org/10.59350/450sp-tdq85.