The First Ruler of Distinguishability

Why Fisher information is the local geometry of statistical models

information geometry

fisher information

pullback metric

riemannian metric

This post introduces Fisher information as the first “ruler” for probability distributions: a local metric that measures how distinguishable nearby statistical models are from data. We will see why it arises naturally as a pullback metric, why Chentsov’s theorem makes it unique, and how it foreshadows the deeper connection between KL divergence, natural gradients, and learning.

Author

Gaurav Khanal

Published

June 5, 2026

1 The Question

Suppose we have two probability distributions, \(p\) and \(q\), defined over the same space. How far apart are they?

This is not a rhetorical question. Different answers lead to different algorithms, different criteria for what it means to learn well, and fundamentally different pictures of what the probability space looks like. The choice of distance is a geometric choice. And geometry, as we will see, has consequences that reach all the way into how neural networks are trained.

The naïve answer (subtract the distributions and take a norm) immediately runs into trouble. The \(L^2\) norm,

\[ \|p-q\|_{L^2} = \sqrt{ \int \left( p(x)-q(x) \right)^2\,dx} \tag{1}\]

is indifferent to where the points of the sample space \(X\) sit relative to each other. Consider a narrow Gaussian concentrated near pixel \(i\) of an image. Shifting it one pixel to the right produces a distribution that is nearly identical to the original in any intuitive sense. But when the distributions are narrow and have little overlap, the \(L^2\) norm can treat a small spatial shift almost as harshly as a much larger shift, because it sees pointwise mismatch rather than displacement cost.

This example is deliberately extreme. For broader distributions, \(L^2\) certainly responds to overlap. The deeper problem is not that \(L^2\) is always insensitive, but that it is not transport-aware: it penalizes pointwise differences regardless of whether probability mass moved a little or a lot. The norm sees only differences in probability mass at each location; it has no vocabulary for the cost of moving that mass through space. For most problems of interest, this is exactly the missing structure.

A better answer requires first deciding what structure the distance should respect: there are two natural answers and they come from completely different directions. In this post, we discuss the first one.

2 Distinguishability as Geometry

The first answer starts from a statistical question: how well can you tell two distributions apart from data?

If \(p\) and \(q\) are very close, distinguishing them requires many samples; if they differ substantially, even a small sample may suffice. This notion of statistical distinguishability is the basis of Fisher’s approach [1].

The key object is the score function, which is the gradient of the log-likelihood with respect to the parameter. For a parameter family \(p(x;\theta)\), the score at \(\theta\) is:

\[ s(x;\theta) = \frac{\partial}{\partial \theta} \log p(x;\theta). \tag{2}\]

The score measures, infinitesimally, how the distribution changes as you move in the parameter space. It is the velocity of the distribution in the direction of \(\theta\).

To quantify how large this velocity is—how much the distribution changes per unit movement in \(\theta\)—Fisher took the expected squared magnitude of the score. In 1-D, this gives the Fisher information [1, 2]:

\[ \mathcal I(\theta) = \mathbb E_\theta \left[ \left( \frac{\partial}{\partial \theta} \log p(x;\theta) \right)^2 \right]. \tag{3}\]

In higher dimensions, this becomes the Fisher Information Matrix (FIM):

\[ F_{ij}(\theta) = \mathbb E_\theta \left[ \partial_i \log p(x;\theta) \, \partial_j \log p(x;\theta) \right]. \tag{4}\]

Notation

Here \(\mathbb E_\theta[\cdot]\) means expectation with respect to \(x\sim p(\cdot\,;\theta)\). The symbol \(\partial_i\) means differentiation with respect to the \(i\)th parameter coordinate, \(\partial/\partial\theta^i\). Later, when expressions such as \(F_{ij}(\theta)\delta^i\delta^j\) appear without an explicit summation sign, they use the standard repeated-index convention:

\[ F_{ij}(\theta)\delta^i\delta^j = \sum_{i,j}F_{ij}(\theta)\delta^i\delta^j. \]

Finally, \(T_\theta\mathcal M\) denotes the tangent space to the parameter manifold at \(\theta\): the vector space of infinitesimal parameter directions.

This matrix is not merely a useful statistic. It is a Riemannian metric tensor on the parameter manifold \(\mathcal{M}\): a smooth varying inner product on each tangent space \(T_{\theta}\mathcal{M}\), encoding the infinitesimal statistical distinguishability of nearby distributions. The systematic study of this structure is the subject of information geometry [3, 4].

3 The FIM as a Pullback Metric

The FIM has a clean geometric origin and is not an ad hoc construction. It is the metric that parameter space inherits from the space of distributions via the parametric family. This is the concept of pullback metric.

Mathematical Depth

Pullback metrics. Given a smooth map \(f:M \to N\) between manifolds and a Riemannian metric \(g\) on \(N\), the pullback metric \(f^{*}g\) on \(M\) is defined by:

\[ \left(f^{*}g \right)(u,v) = g_{f(p)}\left(df_p(u),df_p(v)\right) \tag{5}\]

where \(df_p: T_p M \to T_{f(p)} N\) is the differential of \(f\) at \(p\), and \(u,v \in T_pM\) are tangent vectors. Intuitively, to measure vectors in \(M\), push them forward to \(N\) using \(df\) and measure them using \(g\). For \(f^{*}g\) to be a genuine metric (positive definite), \(f\) must be an immersion, i.e., its differential must be injective at each point.

The FIM as a pullback. Let \(\mathcal{P}\) be the space of probability distributions, equipped with the \(L^2\) inner product on score functions: \(\langle u,v \rangle = \mathbb{E}_p[u \cdot v]\), where \(\mathbb E_p\) means expectation under the distribution \(p\).

In the general pullback notation above, the parameter manifold \(\mathcal M\) plays the role of \(M\), and the probability space \(\mathcal P\) plays the role of \(N\). The parametric family defines the model map

\[ f:\mathcal{M} \to \mathcal{P}, \quad \theta \mapsto p(\cdot\,;\theta). \tag{6}\]

Its differential sends parameter directions to score functions:

\[ df_\theta(\partial_i) = \partial_i \log p(\cdot\,;\theta). \tag{7}\]

This equation is the bridge between parameter geometry and distribution geometry. A coordinate direction \(\partial_i\) in parameter space becomes a score function on the sample space. The pullback of the \(L^2\) inner product is then

\[ \left(f^{*}g\right)_{ij} = \langle df(\partial_i),df(\partial_j)\rangle = \mathbb{E}_{\theta}\left[\partial_i \log p \cdot \partial_j \log p \right] = F_{ij}(\theta) \tag{8}\]

The FIM is the metric that \(\mathcal{M}\) inherits from \(\mathcal{P}\). Geometry flows downward from the space of distributions to the parameter space.

Coordinate invariance. The entries of the Fisher Information Matrix depend on the coordinates used to describe the parameter space. But the underlying geometric object does not.

Mathematically, the FIM is a \((0,2)\)-tensor: at each point \(\theta\), it takes two tangent vectors \(u,v \in T_\theta\mathcal M\) and returns a scalar,

\[ F_\theta(u,v) = \sum_{i,j} F_{ij}(\theta)u^i v^j. \tag{9}\]

In other words, the Fisher matrix is the coordinate representation of an intrinsic inner product on parameter directions. The matrix entries may change when we change coordinates, but the scalar quantity \(F_\theta(u,v)\) does not.

Under a reparameterization \(\tilde{\theta}=\phi(\theta)\), the FIM transforms as a \((0,2)\)-tensor:

\[ \tilde F_{kl} = \sum_{i,j} F_{ij} \frac{\partial \theta^i}{\partial \tilde\theta^k} \frac{\partial \theta^j}{\partial \tilde\theta^l}. \tag{10}\]

This is exactly the transformation law for a Riemannian metric. The FIM captures intrinsic geometry, not a coordinate artifact. Changing how we label the parameters changes the matrix entries, but not the statistical distance measured by the metric.

4 Chentsov’s Theorem: The Unique Invariant Metric

The pullback construction shows that FIM is natural. A deeper result, Chentsov’s theorem [5], shows it is unique.

Mathematical Depth

Chentsov’s Theorem.

Informal Version. The Fisher-Rao metric is the unique Riemannian metric on the statistical manifold, up to a positive scalar multiple, that is invariant under statistically sufficient transformations of the observations.

Equivalently: if a transformation of the data preserves all information relevant to \(\theta\), then it should preserve the geometry of the model. Fisher-Rao is the only Riemannian metric with this property.

A more formal version (finite-dimensional). Let

\[ \Delta_n = \left\{ p=(p_1,\ldots,p_n): p_i>0,\ \sum_{i=1}^n p_i=1 \right\} \tag{11}\]

be the interior of the probability simplex. This is the finite/discrete setting: points of \(\Delta_n\) are categorical probability distributions over \(n\) outcomes. Suppose that for each \(n\), we assign a Riemannian metric \(g^{(n)}\) on \(\Delta_n\). The assignment is called monotone under Markov maps if, for every stochastic map \(T:\Delta_n\to\Delta_m\) (equivalently, a Markov transition matrix),

\[ g^{(m)}_{Tp}(T_*u,T_*u) \leq g^{(n)}_p(u,u), \tag{12}\]

for every tangent vector \(u\in T_p\Delta_n\). In words: applying a noisy data-processing map cannot increase statistical distinguishability.

Here \(T_*u\) denotes the pushforward of the tangent vector \(u\) by the map \(T\): it is the infinitesimal direction obtained after applying the data-processing map.

Chentsov’s theorem says that, up to an overall positive constant, the only such monotone Riemannian metric is the Fisher-Rao metric:

\[ g_p(u,v) = c\sum_{i=1}^n \frac{u_i v_i}{p_i}, \qquad c>0. \tag{13}\]

Equivalently, in a parametric model \(p(x;\theta)\), this metric pulls back to the Fisher information matrix

\[ F_{ij}(\theta) = \mathbb E_\theta \left[ \partial_i\log p(x;\theta)\, \partial_j\log p(x;\theta) \right]. \tag{14}\]

Thus Fisher geometry is not merely invariant under reparameterizing \(\theta\); it is also the unique geometry that contracts under statistically noisy transformations of the observations.

Chentsov is not saying the FIM is a natural choice among many. He is saying it is the only choice consistent with the principle that geometry should not depend on how data is represented. If two experimenters measure the same phenomenon but record their observations differently—and neither loses information—their geometric picture of the statistical model must agree. The FIM is the unique metric with this property.

This is a strong and somewhat surprising result. It says the geometry of statistical inference is not a convention or a convenience: it is forced by the structure of the problem.

This uniqueness result is local. The Fisher metric is an infinitesimal object: it is a tensor at each point of parameter space, giving an inner product on tangent vectors. It measures how fast nearby distributions diverge statistically, but it says nothing directly about distributions that are far apart.

Recovering a global distance requires integrating along geodesics in the Fisher metric, which is generally difficult and depends on the path. For many parametric families (such as Gaussians, exponential families, etc.), geodesics can be computed explicitly. In general, though, they cannot. This locality is a strength (it gives a clean Riemannian theory that is well-defined and coordinate-free) and a genuine limitation (it is not a ready-made global distance between arbitrary distributions).

This is the first major contrast with Wasserstein geometry. Fisher gives a local ruler for statistical distinguishability; Wasserstein will give a global ruler for displacement through the sample space.

5 The Cramér-Rao Bound: Geometry in Action

The Fisher metric does more than organize the geometry of a statistical model. It places hard limits on what any learning or estimation procedure can achieve. A learner navigating parameter space with smaller Fisher metric in some direction cannot acquire information in that direction quickly, no matter how its update rule is designed. The Cramér-Rao bound makes this precise [1, 2].

For an unbiased estimator \(\hat{\theta}\) of a scalar parameter \(\theta\), the bound states:

\[ \operatorname{Var}_{\theta}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)} \tag{15}\]

Here \(\operatorname{Var}_\theta\) denotes variance under \(p(\cdot\,;\theta)\).

In vector form, for an unbiased estimator \(\hat{\theta}\) of \(\theta \in \mathbb{R}^n\):

\[ \operatorname{Cov}_{\theta}(\hat{\theta}) \succeq F(\theta)^{-1} \tag{16}\]

where \(\operatorname{Cov}_\theta\) denotes covariance under \(p(\cdot\,;\theta)\), and \(\succeq\) denotes the Loewner partial order on positive semidefinite matrices: \(A\succeq B\) means \(A-B\) is positive semidefinite. No unbiased estimator can have covariance smaller than \(F(\theta)^{-1}\).

For deep learning, this statement should not be read too literally. We are usually not searching for an unbiased estimator of a true parameter vector; neural networks are overparameterized, non-identifiable, and trained for predictive performance rather than classical parameter recovery. The relevance is geometric: the inverse Fisher describes how sensitive the model distribution is to parameter movement. In optimization language, it acts like a curvature-aware preconditioner, shaping stable step sizes and update directions.

The geometric reading is direct: \(F(\theta)^{-1}\) is the dual metric on the cotangent space, and the bound says that estimation uncertainty is bounded below by the inverse of the local metric scale of the statistical manifold. Where the Fisher metric is large in a direction—where distributions are highly distinguishable and change rapidly along that direction—estimation is easy. Conversely, where the local distinguishability is small, estimation is hard.

Mathematical Depth

Proof sketch. Fix an unbiased estimator \(\hat{\theta}\) of a scalar parameter \(\theta \in \mathbb{R}\). Unbiasedness means

\[ \mathbb{E}_{\theta}[\hat{\theta}(x)] = \int \hat{\theta}(x)p(x;\theta)\,dx = \theta. \tag{17}\]

Differentiate both sides with respect to \(\theta\):

\[ \frac{\partial}{\partial \theta} \int \hat{\theta}(x)p(x;\theta)\,dx = 1. \tag{18}\]

Exchanging derivative and integral gives

\[ \int \hat{\theta}(x) \frac{\partial}{\partial \theta}p(x;\theta)\,dx = 1. \tag{19}\]

Now use the score identity

\[ \frac{\partial}{\partial \theta}p(x;\theta) = p(x;\theta)\, \frac{\partial}{\partial \theta}\log p(x;\theta) = p(x;\theta)s(x;\theta). \tag{20}\]

Substituting this into the previous equation gives

\[ \mathbb{E}_{\theta}[\hat{\theta}(x)s(x;\theta)]=1 \tag{21}\]

where \(s=\partial_{\theta}\log p\) is the score. We also use Fisher’s identity:

\[ \mathbb{E}_{\theta}[s(x;\theta)] = \int \frac{\partial}{\partial \theta}p(x;\theta)\,dx = \frac{\partial}{\partial \theta}\int p(x;\theta)\,dx = 0. \tag{22}\]

Because the score has mean zero, the covariance between \(\hat{\theta}\) and \(s\) is

\[ \operatorname{Cov}_{\theta}(\hat{\theta},s) = \mathbb{E}_{\theta}[\hat{\theta}s] - \mathbb{E}_{\theta}[\hat{\theta}]\mathbb{E}_{\theta}[s] = 1. \tag{23}\]

Now apply the Cauchy-Schwarz inequality to this covariance:

\[ \operatorname{Cov}_{\theta}(\hat{\theta},s)^2 \leq \operatorname{Var}_{\theta}(\hat{\theta})\cdot\operatorname{Var}_{\theta}(s) \tag{24}\]

The left side is \(1\), and the variance of the score is the Fisher information:

\[ \operatorname{Var}_{\theta}(s) = \mathbb{E}_{\theta}[s^2] = \mathcal{I}(\theta). \tag{25}\]

Therefore,

\[ 1 \leq \operatorname{Var}_{\theta}(\hat{\theta})\mathcal{I}(\theta), \tag{26}\]

which is equivalent to

\[ \operatorname{Var}_{\theta}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)}. \tag{27}\]

The multivariate case follows the same logic, but covariance replaces variance and the score becomes a vector. Let

\[ s(x;\theta) = \nabla_\theta \log p(x;\theta). \tag{28}\]

For an unbiased vector estimator, differentiating \(\mathbb{E}_\theta[\hat{\theta}]=\theta\) gives

\[ \mathbb{E}_\theta[(\hat{\theta}-\theta)s^\top]=I. \tag{29}\]

This is the vector analogue of the scalar covariance identity above. Applying the matrix Cauchy-Schwarz inequality yields

\[ \operatorname{Cov}_{\theta}(\hat{\theta}) \succeq F(\theta)^{-1}. \tag{30}\]

One way to read this step is through a Schur-complement argument: the joint covariance block matrix of the estimator error and the score is positive semidefinite, and using \(\mathbb{E}_\theta[(\hat{\theta}-\theta)s^\top]=I\) forces the estimator covariance block to dominate the inverse Fisher block. This is exactly the matrix version of the scalar inequality above.

The Cramér-Rao bound is one of the oldest results in mathematical statistics but its geometric interpretation—as a statement about the inverse of a Riemannian metric—was not fully articulated until Rao’s formulation [2] and the subsequent development of information geometry [3].

6 What the Fisher Metric Cannot See

The Fisher metric is defined entirely within parameter space. It requires a parametric family \(p(x;\theta)\) and measures how the model changes as \(\theta\) varies. It is sensitive to the statistical shape of the model—how distributions in the family differ from each other—and entirely indifferent to the geometry of the sample space \(X\) itself.

If you relabel the points of \(X\) arbitrarily—i.e., apply any measurable bijection to the data—the Fisher metric does not change. This is exactly Chentsov’s invariance property that we touched earlier, and it is both, a strength and a limitation.

It is a strength because it means the geometry is intrinsic to the statistical model, not an artifact of data representation. It is a limitation because it means the Fisher metric has no vocabulary for the spatial layout of the data. To visualize the distinction, imagine a density plot: the horizontal axis is the sample space, while the vertical axis records how much probability mass sits at each location. Fisher sees changes in the heights of that plot, but not the geometry along the horizontal axis. It cannot tell you that the image of a cat shifted two pixels to the right is closer to the original than a completely different image. It lives entirely in the vertical direction, how probability weights change, and is blind to the horizontal direction of where points live in space.

Consequence for Learning

Learning algorithms that minimize Kullback-Leibler (KL) divergence (maximum likelihood estimation, variational autoencoders, and most of classical statistical inference) are navigating Fisher geometry locally [3, 6]. This is not merely an analogy; a future post will make the connection precise. For now, the key point is that KL-based objectives are sensitive to the parametric structure of the model and indifferent to the spatial layout of the data.

A related failure appears for \(f\)-divergences such as Jensen-Shannon in early GANs: when supports are disjoint or nearly disjoint, the divergence can saturate and provide weak gradients, motivating Wasserstein-based objectives [7]. The failure traces directly to the geometry: locally, KL induces Fisher geometry, and Fisher geometry has no concept of spatial distance.

Natural gradient descent [8] corrects the optimization geometry by using \(F(\theta)^{-1}\) to rescale the gradient, moving in the steepest direction with respect to the Fisher metric rather than the Euclidean one. This is the geometrically correct update in parameter space—but it is still confined to the vertical, statistical geometry, which we will come back to in a future post.

7 Preview: Fisher as Local KL Geometry

The Fisher metric does not stand alone. It is the local, symmetric residue of a richer asymmetric object: the KL divergence. This connection—which we will come back to in another blog post—is worth previewing here because it explains why the Fisher metric appears so naturally in learning.

Mathematical Depth

Fisher metric as the Hessian of KL. Fix a parametric family \(p(\cdot\,;\theta)\) and define a local KL function by holding the second argument fixed:

\[ D_{\theta}(\delta) = \mathrm{KL} \left( p(\cdot\,;\theta+\delta) \,\Big\|\, p(\cdot\,;\theta) \right). \tag{31}\]

At \(\delta=0\), the two distributions agree, so

\[ D_{\theta}(0)=0. \tag{32}\]

The first derivative also vanishes. Intuitively, KL is minimized when the two distributions are the same, so there is no first-order change at \(\delta=0\):

\[ \left. \frac{\partial D_{\theta}(\delta)} {\partial \delta^i} \right|_{\delta=0} = 0. \tag{33}\]

The second derivative is the Fisher information matrix:

\[ \left. \frac{\partial^2 D_{\theta}(\delta)} {\partial \delta^i \partial \delta^j} \right|_{\delta=0} = F_{ij}(\theta). \tag{34}\]

Therefore, the Taylor expansion begins at second order:

\[ \mathrm{KL} \left( p(\cdot\,;\theta + \delta) \,\Big\|\, p(\cdot\,;\theta) \right) = \frac{1}{2}\,F_{ij}(\theta)\delta^i \delta^j + \mathcal{O}(\|\delta\|^3). \tag{35}\]

The same second-order term appears if the KL direction is reversed:

\[ \mathrm{KL} \left( p(\cdot\,;\theta) \,\Big\|\, p(\cdot\,;\theta + \delta) \right) = \frac{1}{2}\,F_{ij}(\theta)\delta^i \delta^j + \mathcal{O}(\|\delta\|^3). \tag{36}\]

This is why Fisher geometry is the local symmetric part of KL. The asymmetry of KL is real, but it does not show up in the quadratic approximation; it appears only at third order and beyond.

Here \(\mathcal{O}(\|\delta\|^3)\) means that the omitted terms are of cubic order or higher in the displacement size \(\|\delta\|\).

This means two things. First, any algorithm that minimizes KL divergence inherits Fisher geometry in its local second-order behavior. Minimizing KL gives Fisher as the local geometry, but the actual update rule may still use Euclidean gradients—this is precisely the gap that natural gradient descent, which will be discussed in future posts, is designed to close. Second, the Fisher metric inherits its symmetry from the Hessian operation: even though KL itself is asymmetric (because \(\mathrm{KL}(p\|q) \neq \mathrm{KL}(q\|p)\) in general), its second-order term is a symmetric bilinear form—the metric tensor \(F_{ij}\).

The asymmetry of KL beyond second order encodes something important: the two directions of KL correspond to qualitatively different behaviors (mode-covering vs. mode-seeking). I will examine this asymmetry in detail and show how it points directly toward Schrödinger bridge.

Consequence for Learning

Many learning objectives compare distributions through KL divergence or likelihood. Locally, these objectives inherit the Fisher metric: directions in parameter space are not equally meaningful, because some directions change the model distribution much more than others.

Ordinary gradient descent ignores this geometry. It treats parameter space as Euclidean, so a step of the same coordinate size is treated as equally important in every direction. Natural gradient descent corrects this by using \(F(\theta)^{-1}\nabla L\), the steepest descent direction measured in distribution space rather than raw coordinates.

For modern neural networks, however, the full FIM is impossible to form explicitly: with \(N\) parameters, it is an \(N\times N\) matrix. Practical methods therefore use approximations such as empirical Fisher estimates, matrix-free Fisher-vector products, diagonal approximations, low-rank approximations, or structured approximations such as K-FAC. These approximations are an active research area: they change the computation, but not the underlying geometric object.

This is the first practical lesson of Fisher geometry: learning is not just about the slope of the loss, but about the geometry of how parameters change distributions.

8 The Fisher Metric in the Exponential Family

The previous section showed that KL locally contains Fisher geometry in its second-order term. Exponential families are the cleanest setting where this local picture becomes algebraic: the Fisher metric, KL divergence, and convex duality can all be written in terms of a single function, the log-partition function.

The Fisher metric takes a particularly clean form for exponential families, which include Gaussians, Bernoulli, Poisson and several other distributions used in practice.

An exponential family has density

\[ p(x;\theta) = h(x)\exp\,\{\theta^{\top}\,T(x)-A(\theta)\} \tag{37}\]

where \(T(x)\) are the sufficient statistics, \(\theta\) are the natural parameters, and \(A(\theta)\) is the log-partition function. A direct calculation shows that the FIM is the Hessian of \(A\):

\[ F_{ij} = \frac{\partial^2\,A}{\partial\theta^i\,\partial\theta^j}. \tag{38}\]

Mathematical Depth

Derivation. For the exponential family, the log-density is

\[ \log p(x;\theta) = \theta^{\top}T(x)-A(\theta)+\log h(x). \tag{39}\]

Differentiating with respect to the natural parameter \(\theta^i\) gives the score component

\[ \partial_i \log p(x;\theta) = T_i(x)-\partial_i A(\theta). \tag{40}\]

The score has mean zero by Fisher’s identity:

\[ \mathbb{E}_{\theta}\left[\partial_i \log p(x;\theta)\right] = 0. \tag{41}\]

Therefore,

\[ F_{ij} = \mathbb{E}_{\theta} \left[ \left(T_i-\partial_i A(\theta)\right) \left(T_j-\partial_j A(\theta)\right) \right] = \operatorname{Cov}_{\theta}(T_i,T_j) = \partial_i\partial_j A(\theta) \tag{42}\]

The FIM is the covariance of the sufficient statistics, or equivalently, the Hessian of the log-partition function with respect to the natural parameters \(\theta\). Since \(A\) is convex, \(F\) is positive semidefinite, confirming that the FIM is a valid Riemannian metric.

Example: Gaussians. The Gaussian \(\mathcal{N}(\mu,\sigma^2)\) belongs to an exponential family with natural parameters

\[ \theta_1=\frac{\mu}{\sigma^2}, \qquad \theta_2=-\frac{1}{2\sigma^2}. \tag{43}\]

In those coordinates,

\[ F_{\theta} = \nabla^2 A(\theta). \tag{44}\]

To express the same metric in the more familiar coordinates \((\mu,\sigma^2)\), use the tensor transformation rule. If \(\alpha=(\mu,\sigma^2)\) denotes the new coordinates, then

\[ F^{(\alpha)}_{ab} = \sum_{i,j} F^{(\theta)}_{ij} \frac{\partial \theta^i}{\partial \alpha^a} \frac{\partial \theta^j}{\partial \alpha^b}. \tag{45}\]

Applying this transformation gives

\[ F(\mu,\sigma^2) = \begin{pmatrix} 1/\sigma^2 & 0 \\ 0 & 1/(2\sigma^4) \end{pmatrix} \tag{46}\]

This is the correct Fisher metric in \((\mu,\sigma^2)\) coordinates, but it is not obtained as the Hessian of \(A\) with respect to \((\mu,\sigma^2)\) directly—those are not natural parameters. Geodesic distance in the Fisher metric on the space of Gaussians has closed form expression and is well-studied [4].

Signpost: dual coordinates. The fact that the Fisher metric on an exponential family is the Hessian of a convex function is deeply connected to the duality structure of information geometry [3]. The convex function \(A\) generates the natural-parameter coordinates \(\theta\). Its gradient generates the expectation-parameter coordinates

\[ \eta = \nabla A(\theta). \tag{47}\]

The Legendre transform of \(A\) then exchanges these two coordinate systems:

\[ A^{*}(\eta) = \sup_{\theta} \left\{ \theta^{\top}\eta - A(\theta) \right\}. \tag{48}\]

In the same exponential-family setting, KL divergence between two members of the family is the Bregman divergence generated by \(A\):

\[ D_A(\theta',\theta) = A(\theta') -A(\theta) -\nabla A(\theta)^{\top}(\theta'-\theta). \tag{49}\]

This is only a signpost for now. The important point for this post is that Fisher geometry is not isolated: it sits inside a larger convex-dual structure that will return when we discuss KL divergence, natural gradients, and learning dynamics.

9 A First Uniqueness Result

I close this post by returning to Chentsov’s theorem, because it gives Fisher geometry a special status.

The Fisher metric is not just a convenient way to measure local distinguishability. Given one natural requirement—that the geometry of a statistical model should not depend on how the data are represented, as long as no statistical information is lost—the Fisher-Rao metric is forced. Up to a constant scale factor, it is the only Riemannian metric with this invariance property.

That is the main lesson of this first ruler. Fisher information does not merely assign numbers to estimators or appear in asymptotic statistics. It defines the unique local geometry compatible with statistical distinguishability. If we care about how distributions change as models learn, this is the geometry that the model carries.

But this uniqueness also explains Fisher’s limitation. Because it is invariant under sufficient transformations of the sample space, Fisher geometry is blind to the spatial arrangement of the sample space itself. It knows how probability weights change; it does not know how far mass has moved.

This is exactly where the second ruler enters. The next post turns to Wasserstein geometry, where distance is not measured by distinguishability but by displacement. Brenier’s theorem [9] will give a second kind of uniqueness: under quadratic transport cost, the optimal way to move mass is not arbitrary, but generated by the gradient of a convex potential.

10 What Comes Next

This post introduced the first ruler: Fisher information, the local geometry of statistical distinguishability. The next post introduces the second ruler: Wasserstein distance, the global geometry of displacement.

Together, these two rulers set up the main tension of the series. Fisher asks how distributions differ statistically; Wasserstein asks how probability mass moves spatially. Learning needs both perspectives.

From there, the story becomes more interesting. KL divergence will appear as the asymmetric object underlying Fisher geometry. Schrödinger bridges will connect entropy and transport into a single stochastic path. Hodge decomposition will reveal the algebraic anatomy of optimal flows. Natural gradient will show what it means for optimization to respect statistical geometry. Langevin dynamics will then add calibrated noise, turning gradient descent from a mode-seeking optimizer into a sampler over distributions.

Finally, these ideas will reappear in modern machine learning, where diffusion models, flow matching, natural gradients, Langevin samplers, and Bayesian methods can all be understood as approximations to pieces of this geometry.

The goal of this series is not to argue that these subjects are secretly the same theory. They are not. Rather, the claim is that they discovered compatible pieces of the same geometric picture: information provides the local ruler, transport provides the spatial ruler, KL provides direction, entropy provides stochastic paths, Langevin connects optimization to sampling, and topology explains the obstructions.

༺ The End ༻

References

[1]

R. A. Fisher. “Theory of statistical estimation.” Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), pp. 700–725, 1925.

[2]

C. R. Rao. “Information and the accuracy attainable in the estimation of statistical parameters.” Bulletin of the Calcutta Mathematical Society, 37, pp. 81–91, 1945.

[3]

S. Amari and H. Nagaoka. Methods of information geometry. American Mathematical Society, 2000.

[4]

S. Amari. Information geometry and its applications. Springer, 2016.

[5]

N. N. Chentsov. Statistical decision rules and optimal inference. American Mathematical Society, 1982.

[6]

M. J. Wainwright and M. I. Jordan. “Graphical models, exponential families, and variational inference.” Foundations and Trends in Machine Learning, 1(1–2), pp. 1–305, 2008.

[7]

M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein generative adversarial networks.” Proceedings of the 34th international conference on machine learning, 2017, pp. 214–223.

[8]

S. Amari. “Natural gradient works efficiently in learning.” Neural Computation, 10(2), pp. 251–276, 1998.

[9]

Y. Brenier. “Polar factorization and monotone rearrangement of vector-valued functions.” Communications on Pure and Applied Mathematics, 44(4), pp. 375–417, 1991.