Gradients of Energy-Based Models

Energy based models (EBMs) are drawing a lot of recent attention. Importantly, you can write the gradient of the log-likelihood of an EBM with respect to the parameters. However, this gradient is commonly stated in papers without a derivation, so I thought I would derive it here.

Consider an energy based model

\[ p(x) = \frac{1}{Z(\theta)} \, e^{-E_\theta(x)} \]

with normalizing constant $Z(\theta)$. Papers often state that the gradient of $\log p_\theta(x)$ with respect to $\theta$ is

\[ \frac{\partial}{\partial \theta} \log p_\theta(x) = \mathbb{E}_{p_\theta(x)} \left[ \frac{\partial}{\partial \theta} E_\theta(x) \right] - \frac{\partial}{\partial \theta} E_\theta(x). \]

But where does this come from? Here we derive it using the log-derivative trick and with one key assumption. We start by writing out the gradient

\[ \frac{\partial}{\partial \theta} \log p_\theta(x) = \frac{\partial}{\partial \theta} \left[ -\log Z(\theta) - E_\theta(x) \right] = - \frac{\partial}{\partial \theta} \log Z(\theta) - \frac{\partial}{\partial \theta} E_\theta(x) \]

and notice that we have already identified the second term in the gradient. The first term requires some care. We start by using the log-derivative trick

\[ \frac{\partial}{\partial \theta} \log Z(\theta) = \frac{ \frac{\partial}{\partial \theta} Z(\theta)}{Z(\theta)}. \]

Next, we derive $\frac{\partial}{\partial \theta} Z(\theta)$ with the key assumption that we can interchange integration and differentiation

\[ \frac{\partial}{\partial \theta} Z(\theta) = \frac{\partial}{\partial \theta} \int e^{-E_\theta(x)} \mathrm{d}x = \int \frac{\partial}{\partial \theta} e^{-E_\theta(x)} \mathrm{d}x. \]

Putting together the pieces gives us

\[ \begin{align} \frac{\partial}{\partial \theta} \log Z(\theta) &= \frac{1}{Z(\theta)} \int \frac{\partial}{\partial \theta} e^{-E_\theta(x)} \mathrm{d}x \\ & = \int \frac{1}{Z(\theta)} \frac{\partial}{\partial \theta} e^{-E_\theta(x)} \mathrm{d}x \\ & = - \int \frac{1}{Z(\theta)} e^{-E_\theta(x)} \frac{\partial}{\partial \theta} E_\theta(x) \mathrm{d}x \\ & = - \mathbb{E}_{p_\theta(x)} \left[ \frac{\partial}{\partial \theta} E_\theta(x) \right]. \end{align} \]

We are done! We can plus this into the equation above (keeping track of minus signs) to get

\[ \frac{\partial}{\partial \theta} \log p_\theta(x) = \mathbb{E}_{p_\theta(x)}\left[\frac{\partial}{\partial \theta} E\_\theta(x) \right]- \frac{\partial}{\partial \theta} E_\theta(x). \]

David Zoltowski
Postdoctoral Fellow in Statistics

My research interests include statistical neuroscience and topics in Bayesian machine learning.