KL Divergence

KL Divergence is a measure of how one probability distribution is different from another. It is a measure of the difference between two probability distributions.
statistics
fundamentals
Published

August 27, 2025

Motivation and Definition

Let’s motivate the KL divergence in a simple way.

Suppose our beliefs about some quantity are encoded by a prior distribution \(P\). After seeing new evidence, we update to a distribution \(Q\). How far did our beliefs move? We seek a function \(I(P\to Q)\) that measures this information gain with the following desiderata:

  • Continuity: small changes in \(P\) or \(Q\) should produce small changes in \(I\).
  • Reparameterization invariance: relabeling outcomes or changing units must not change \(I\).
  • Non-negativity and identity: \(I(P\to Q) \ge 0\), with equality iff \(P = Q\).
  • Monotonicity: concentrating probability mass more strongly should increase \(I\) (e.g., narrowing a uniform prior over 24 candidates to 5 should yield a larger value than narrowing to 12).
  • Additive decomposition: for joint variables, \(I\) should decompose naturally into marginal and conditional contributions.

Under mild regularity conditions, these axioms uniquely characterize (up to a positive constant) the Kullback–Leibler divergence. In other words, the only function that satisfies all the above properties is the KL divergence.

KL Divergence

Given two continuous probability distributions \(P\) and \(Q\), the KL divergence is defined as:

\[ D_{KL}(P||Q) = \int_{-\infty}^{\infty} P(x) \log \frac{P(x)}{Q(x)} dx \tag{1}\]

KL Divergence properties

Non-negativity

To prove that the KL Divergence is always non-negative, i.e. \(D_{KL}(P||Q) \geq 0\), we need a mathematical tool called the Jensen’s inequality.

Jensen’s inequality

Jensen’s inequality states that for a convex function \(f\), the following inequality holds:

\[ f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)] \]

Proof. Starting with the definition of the KL Divergence \(D_{KL}(P||Q) = \int P(x)\,\log \frac{P(x)}{Q(x)}\,dx.\) (as given in Equation 1), we can rewrite the formula as the expectation of the logarithm of the ratio of \(P(x)\) and \(Q(x)\): \[ D_{KL}(P||Q) = -\mathbb{E}_{x \sim P}\!\left[\log \frac{Q(x)}{P(x)}\right]. \]

Since \(\log()\) is concave, Jensen’s inequality gives \[ -\mathbb{E}_{P}\left[\log \dfrac{Q(x)}{P(x)}\right] \geq -\log \mathbb{E}_{P}\left[\dfrac{Q(x)}{P(x)}\right]. \]

The expectation now simplifies: \[ \log \mathbb{E}_{P}\!\left[\tfrac{Q(x)}{P(x)}\right] = \log \int P(x) \dfrac{Q(x)}{P(x)} dx = \log\int Q(x) dx = \log 1 = 0. \]

Therefore, \[ D_{KL}(P||Q) \;\geq\; -\log 1 = 0. \]

Equality holds iff (P(x)=Q(x)) almost everywhere.

To empirically verify the non-negativity of the KL Divergence, you can toggle the mean and variance of the two normal distributions and see the corresponding KL Divergence value in the plot below.

Convexity

Note

Great acknowledge from The Book of Statistical Proofs for the proof of the convexity of the KL divergence.

The KL divergence is a convex function, i.e.

\[ \mathrm{KL}[\lambda p_1 + (1-\lambda) p_2||\lambda q_1 + (1-\lambda) q_2] \leq \lambda \mathrm{KL}[p_1||q_1] + (1-\lambda) \mathrm{KL}[p_2||q_2] \tag{2}\]

Recall that a function \(f\) is convex if for any two points \(x_1\) and \(x_2\) in the domain of \(f\),
\(f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2)\) for any \(\lambda \in [0, 1]\).

for any two probability distributions \(p_1\) and \(p_2\) and any \(\lambda \in [0, 1]\).

To prove this, we need the Log-sum inequality.

Log-sum inequality

For \(a_1, \dots a_n\) and \(b_1, \dots b_n\), the Log-sum inequality states that:

\[\sum_{i=1}^n a_i \log \frac{a_i}{b_i} \geq \left( \sum_{i=1}^n a_i \right) \log \frac{\sum_{i=1}^n a_i}{\sum_{i=1}^n b_i}\]

Proof.

\[\begin{aligned} \mathrm{KL}[\lambda p_1 + (1-\lambda) p_2 \,\|\, \lambda q_1 + (1-\lambda) q_2] &= \sum_{x \in \mathcal{X}} \Big[ (\lambda p_1(x) + (1-\lambda) p_2(x)) \log \tfrac{\lambda p_1(x) + (1-\lambda) p_2(x)}{\lambda q_1(x) + (1-\lambda) q_2(x)} \Big] \\ &\le \sum_{x \in \mathcal{X}} \Big[ \lambda p_1(x) \log \tfrac{\lambda p_1(x)}{\lambda q_1(x)} + (1-\lambda) p_2(x) \log \tfrac{(1-\lambda) p_2(x)}{(1-\lambda) q_2(x)} \Big] \\ &= \lambda \sum_{x \in \mathcal{X}} p_1(x) \log \tfrac{p_1(x)}{q_1(x)} + (1-\lambda) \sum_{x \in \mathcal{X}} p_2(x) \log \tfrac{p_2(x)}{q_2(x)} \\ &= \lambda \, \mathrm{KL}[p_1\|q_1] + (1-\lambda) \, \mathrm{KL}[p_2\|q_2] \end{aligned}\]

Convexity guarantees that projecting onto a convex family by minimizing KL admits a unique global solution, which is why I- and M-projections are well-behaved. It also underlies the stability of optimization routines that target KL—such as variational inference or iterative scaling—by ruling out spurious local minima within convex sets.

In practice, convexity is what justifies monotone‑improvement arguments for procedures like EM: each step can be shown to decrease an appropriate KL objective.

Monotonicity

Prove that \(D_{KL}(P(x,y) \| Q(x,y)) \geq D_{KL}(P(x) \| Q(x))\).

Intuition: “If you only partially observe random variables, it is harder to distinguish between two candidate distributions than if you observed all of them.”

\[ \begin{aligned} D_{KL}(P(x,y)\|Q(x,y)) &= \int P(x,y)\,\log \frac{P(x,y)}{Q(x,y)} \,dx\,dy \\ &= \int P(y)\,dy \int P(x|y)\,\log \frac{P(x|y)P(y)}{Q(x|y)Q(y)} \,dx \\ &= -\int P(y)\,dy \;\mathbb{E}_{x \sim P(x|y)}\!\left[\log \frac{Q(x|y)Q(y)}{P(x|y)P(y)}\right] \\ &\geq -\int P(y)\,dy \;\log \mathbb{E}_{x \sim P(x|y)}\!\left[\frac{Q(x|y)Q(y)}{P(x|y)P(y)}\right] \\ &= -\int P(y)\,\log\!\left(\frac{Q(y)}{P(y)} \int P(x|y)\frac{Q(x|y)}{P(x|y)}\,dx\right) dy \\ &= -\int P(y)\,\log\!\left(\frac{Q(y)}{P(y)} \cdot \underbrace{\int Q(x|y)\,dx}_{=1}\right) dy \\ &= -\int P(y)\,\log \frac{Q(y)}{P(y)}\,dy \\ &= D_{KL}(P(y)\|Q(y)). \end{aligned} \]

KL Divergence Generalization

KL divergence is a special case of the more general \(f\)-divergence, which is defined as:

\[ D_f(P||Q) = \int p(x) f\left(\frac{p(x)}{q(x)}\right) dx \]

where \(f\) is a convex function. In the case of the KL Divergence, \(f(x) = x \log x\). Going into more detail about the \(f\)-divergence is beyong the scope of this note, interested readers might find this note useful.

KL Divergence approximation

Questions

  • Show that for a parametric model family \(\{Q_\theta\}\) and data \(x_{1:n}\sim P\), maximizing the log-likelihood \(\sum_{i=1}^n \log Q_\theta(x_i)\) is equivalent to minimizing the forward KL \(\mathrm{KL}(\hat P_n\,\|\,Q_\theta)\), and in the population limit to minimizing \(\mathrm{KL}(P\,\|\,Q_\theta)\) up to an additive entropy constant.

  • Show that the KL Divergence between two normal distributions is equivalent to the squared difference of the means divided by the variance of the second distribution, ie \(D_{KL}(P||Q) = \frac{(\mu_1 - \mu_2)^2}{2\sigma_2^2}\), where \(P \sim N(\mu_1, \sigma_1^2)\) and \(Q \sim N(\mu_2, \sigma_2^2)\).

References