<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Gaurav Khanal</title>
<link>https://grvkhnl.github.io/blog.html</link>
<atom:link href="https://grvkhnl.github.io/blog.xml" rel="self" type="application/rss+xml"/>
<description>Essays, notes, and technical writing.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Fri, 05 Jun 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>The First Ruler of Distinguishability</title>
  <dc:creator>Gaurav Khanal</dc:creator>
  <link>https://grvkhnl.github.io/posts/the-first-ruler-of-distinguishability/</link>
  <description><![CDATA[ 





<section id="the-question" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="the-question"><span class="header-section-number">1</span> The Question</h2>
<p>Suppose we have two probability distributions, <img src="https://latex.codecogs.com/png.latex?p"> and <img src="https://latex.codecogs.com/png.latex?q">, defined over the same space. How far apart are they?</p>
<p>This is not a rhetorical question. Different answers lead to different algorithms, different criteria for what it means to learn well, and fundamentally different pictures of what the probability space looks like. The choice of distance is a geometric choice. And geometry, as we will see, has consequences that reach all the way into how neural networks are trained.</p>
<p>The naïve answer (subtract the distributions and take a norm) immediately runs into trouble. The <img src="https://latex.codecogs.com/png.latex?L%5E2"> norm,</p>
<p><span id="eq-l2-norm"><img src="https://latex.codecogs.com/png.latex?%0A%5C%7Cp-q%5C%7C_%7BL%5E2%7D%0A=%0A%5Csqrt%7B%20%5Cint%20%5Cleft(%20p(x)-q(x)%20%5Cright)%5E2%5C,dx%7D%0A%5Ctag%7B1%7D"></span></p>
<p>is indifferent to where the points of the sample space <img src="https://latex.codecogs.com/png.latex?X"> sit relative to each other. Consider a narrow Gaussian concentrated near pixel <img src="https://latex.codecogs.com/png.latex?i"> of an image. Shifting it one pixel to the right produces a distribution that is nearly identical to the original in any intuitive sense. But when the distributions are narrow and have little overlap, the <img src="https://latex.codecogs.com/png.latex?L%5E2"> norm can treat a small spatial shift almost as harshly as a much larger shift, because it sees pointwise mismatch rather than displacement cost.</p>
<p>This example is deliberately extreme. For broader distributions, <img src="https://latex.codecogs.com/png.latex?L%5E2"> certainly responds to overlap. The deeper problem is not that <img src="https://latex.codecogs.com/png.latex?L%5E2"> is always insensitive, but that it is not transport-aware: it penalizes pointwise differences regardless of whether probability mass moved a little or a lot. The norm sees only differences in probability mass at each location; it has no vocabulary for the cost of moving that mass through space. For most problems of interest, this is exactly the missing structure.</p>
<p>A better answer requires first deciding what structure the distance should respect: there are two natural answers and they come from completely different directions. In this post, we discuss the first one.</p>
</section>
<section id="distinguishability-as-geometry" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="distinguishability-as-geometry"><span class="header-section-number">2</span> Distinguishability as Geometry</h2>
<p>The first answer starts from a statistical question: <em>how well can you tell two distributions apart from data?</em></p>
<p>If <img src="https://latex.codecogs.com/png.latex?p"> and <img src="https://latex.codecogs.com/png.latex?q"> are very close, distinguishing them requires many samples; if they differ substantially, even a small sample may suffice. This notion of statistical distinguishability is the basis of Fisher’s approach <span class="citation" data-cites="fisher1925">[1]</span>.</p>
<p>The key object is the <em>score function</em>, which is the gradient of the log-likelihood with respect to the parameter. For a parameter family <img src="https://latex.codecogs.com/png.latex?p(x;%5Ctheta)">, the score at <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> is:</p>
<p><span id="eq-score-function"><img src="https://latex.codecogs.com/png.latex?%0As(x;%5Ctheta)%0A=%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7D%0A%5Clog%20p(x;%5Ctheta).%0A%5Ctag%7B2%7D"></span></p>
<p>The score measures, infinitesimally, how the distribution changes as you move in the parameter space. It is the velocity of the distribution in the direction of <img src="https://latex.codecogs.com/png.latex?%5Ctheta">.</p>
<p>To quantify how large this velocity is—how much the distribution changes per unit movement in <img src="https://latex.codecogs.com/png.latex?%5Ctheta">—Fisher took the expected squared magnitude of the score. In 1-D, this gives the <em>Fisher information</em> <span class="citation" data-cites="fisher1925 rao1945">[1, 2]</span>:</p>
<p><span id="eq-fisher-information"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathcal%20I(%5Ctheta)%0A=%0A%5Cmathbb%20E_%5Ctheta%0A%5Cleft%5B%0A%20%20%5Cleft(%0A%20%20%20%20%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7D%0A%20%20%20%20%5Clog%20p(x;%5Ctheta)%0A%20%20%5Cright)%5E2%0A%5Cright%5D.%0A%5Ctag%7B3%7D"></span></p>
<p>In higher dimensions, this becomes the Fisher Information Matrix (FIM):</p>
<p><span id="eq-fisher-information-matrix"><img src="https://latex.codecogs.com/png.latex?%0AF_%7Bij%7D(%5Ctheta)%0A=%0A%5Cmathbb%20E_%5Ctheta%0A%5Cleft%5B%0A%20%20%5Cpartial_i%20%5Clog%20p(x;%5Ctheta)%0A%20%20%5C,%0A%20%20%5Cpartial_j%20%5Clog%20p(x;%5Ctheta)%0A%5Cright%5D.%0A%5Ctag%7B4%7D"></span></p>
<div class="precision-box callout callout-style-default callout-note callout-titled" title="Notation">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Notation
</div>
</div>
<div class="callout-body-container callout-body">
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%20E_%5Ctheta%5B%5Ccdot%5D"> means expectation with respect to <img src="https://latex.codecogs.com/png.latex?x%5Csim%20p(%5Ccdot%5C,;%5Ctheta)">. The symbol <img src="https://latex.codecogs.com/png.latex?%5Cpartial_i"> means differentiation with respect to the <img src="https://latex.codecogs.com/png.latex?i">th parameter coordinate, <img src="https://latex.codecogs.com/png.latex?%5Cpartial/%5Cpartial%5Ctheta%5Ei">. Later, when expressions such as <img src="https://latex.codecogs.com/png.latex?F_%7Bij%7D(%5Ctheta)%5Cdelta%5Ei%5Cdelta%5Ej"> appear without an explicit summation sign, they use the standard repeated-index convention:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AF_%7Bij%7D(%5Ctheta)%5Cdelta%5Ei%5Cdelta%5Ej%0A=%0A%5Csum_%7Bi,j%7DF_%7Bij%7D(%5Ctheta)%5Cdelta%5Ei%5Cdelta%5Ej.%0A"></p>
<p>Finally, <img src="https://latex.codecogs.com/png.latex?T_%5Ctheta%5Cmathcal%20M"> denotes the tangent space to the parameter manifold at <img src="https://latex.codecogs.com/png.latex?%5Ctheta">: the vector space of infinitesimal parameter directions.</p>
</div>
</div>
<p>This matrix is not merely a useful statistic. It is a <em>Riemannian metric tensor</em> on the parameter manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D">: a smooth varying inner product on each tangent space <img src="https://latex.codecogs.com/png.latex?T_%7B%5Ctheta%7D%5Cmathcal%7BM%7D">, encoding the infinitesimal statistical distinguishability of nearby distributions. The systematic study of this structure is the subject of information geometry <span class="citation" data-cites="amari2000 amari2016">[3, 4]</span>.</p>
</section>
<section id="the-fim-as-a-pullback-metric" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="the-fim-as-a-pullback-metric"><span class="header-section-number">3</span> The FIM as a Pullback Metric</h2>
<p>The FIM has a clean geometric origin and is not an <em>ad hoc</em> construction. It is the metric that parameter space <em>inherits</em> from the space of distributions via the parametric family. This is the concept of <em>pullback metric</em>.</p>
<div class="math-depth callout callout-style-default callout-tip callout-titled" title="Mathematical Depth">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Mathematical Depth
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong><em>Pullback metrics.</em></strong> Given a smooth map <img src="https://latex.codecogs.com/png.latex?f:M%20%5Cto%20N"> between manifolds and a Riemannian metric <img src="https://latex.codecogs.com/png.latex?g"> on <img src="https://latex.codecogs.com/png.latex?N">, the pullback metric <img src="https://latex.codecogs.com/png.latex?f%5E%7B*%7Dg"> on <img src="https://latex.codecogs.com/png.latex?M"> is defined by:</p>
<p><span id="eq-pullback-metric"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft(f%5E%7B*%7Dg%20%5Cright)(u,v)%0A=%0Ag_%7Bf(p)%7D%5Cleft(df_p(u),df_p(v)%5Cright)%0A%5Ctag%7B5%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?df_p:%20T_p%20M%20%5Cto%20T_%7Bf(p)%7D%20N"> is the differential of <img src="https://latex.codecogs.com/png.latex?f"> at <img src="https://latex.codecogs.com/png.latex?p">, and <img src="https://latex.codecogs.com/png.latex?u,v%20%5Cin%20T_pM"> are tangent vectors. Intuitively, to measure vectors in <img src="https://latex.codecogs.com/png.latex?M">, push them forward to <img src="https://latex.codecogs.com/png.latex?N"> using <img src="https://latex.codecogs.com/png.latex?df"> and measure them using <img src="https://latex.codecogs.com/png.latex?g">. For <img src="https://latex.codecogs.com/png.latex?f%5E%7B*%7Dg"> to be a genuine metric (positive definite), <img src="https://latex.codecogs.com/png.latex?f"> must be an immersion, i.e., its differential must be injective at each point.</p>
<p><strong><em>The FIM as a pullback.</em></strong> Let <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BP%7D"> be the space of probability distributions, equipped with the <img src="https://latex.codecogs.com/png.latex?L%5E2"> inner product on score functions: <img src="https://latex.codecogs.com/png.latex?%5Clangle%20u,v%20%5Crangle%20=%20%5Cmathbb%7BE%7D_p%5Bu%20%5Ccdot%20v%5D">, where <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%20E_p"> means expectation under the distribution <img src="https://latex.codecogs.com/png.latex?p">.</p>
<p>In the general pullback notation above, the parameter manifold <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%20M"> plays the role of <img src="https://latex.codecogs.com/png.latex?M">, and the probability space <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%20P"> plays the role of <img src="https://latex.codecogs.com/png.latex?N">. The parametric family defines the model map</p>
<p><span id="eq-parametric-family-map"><img src="https://latex.codecogs.com/png.latex?%0Af:%5Cmathcal%7BM%7D%20%5Cto%20%5Cmathcal%7BP%7D,%20%5Cquad%20%5Ctheta%20%5Cmapsto%20p(%5Ccdot%5C,;%5Ctheta).%0A%5Ctag%7B6%7D"></span></p>
<p>Its differential sends parameter directions to score functions:</p>
<p><span id="eq-parametric-family-differential-score"><img src="https://latex.codecogs.com/png.latex?%0Adf_%5Ctheta(%5Cpartial_i)%0A=%0A%5Cpartial_i%20%5Clog%20p(%5Ccdot%5C,;%5Ctheta).%0A%5Ctag%7B7%7D"></span></p>
<p>This equation is the bridge between parameter geometry and distribution geometry. A coordinate direction <img src="https://latex.codecogs.com/png.latex?%5Cpartial_i"> in parameter space becomes a score function on the sample space. The pullback of the <img src="https://latex.codecogs.com/png.latex?L%5E2"> inner product is then</p>
<p><span id="eq-pullback-l2-inner-product"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft(f%5E%7B*%7Dg%5Cright)_%7Bij%7D%0A=%0A%5Clangle%20df(%5Cpartial_i),df(%5Cpartial_j)%5Crangle%0A=%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5Cleft%5B%5Cpartial_i%20%5Clog%20p%20%5Ccdot%20%5Cpartial_j%20%5Clog%20p%20%5Cright%5D%0A=%0AF_%7Bij%7D(%5Ctheta)%0A%5Ctag%7B8%7D"></span></p>
<p>The FIM is the metric that <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BM%7D"> inherits from <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BP%7D">. Geometry flows <em>downward</em> from the space of distributions to the parameter space.</p>
<p><strong><em>Coordinate invariance.</em></strong> The entries of the Fisher Information Matrix depend on the coordinates used to describe the parameter space. But the underlying geometric object does not.</p>
<p>Mathematically, the FIM is a <img src="https://latex.codecogs.com/png.latex?(0,2)">-tensor: at each point <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, it takes two tangent vectors <img src="https://latex.codecogs.com/png.latex?u,v%20%5Cin%20T_%5Ctheta%5Cmathcal%20M"> and returns a scalar,</p>
<p><span id="eq-fisher-bilinear-form"><img src="https://latex.codecogs.com/png.latex?%0AF_%5Ctheta(u,v)%0A=%0A%5Csum_%7Bi,j%7D%0AF_%7Bij%7D(%5Ctheta)u%5Ei%20v%5Ej.%0A%5Ctag%7B9%7D"></span></p>
<p>In other words, the Fisher matrix is the coordinate representation of an intrinsic inner product on parameter directions. The matrix entries may change when we change coordinates, but the scalar quantity <img src="https://latex.codecogs.com/png.latex?F_%5Ctheta(u,v)"> does not.</p>
<p>Under a reparameterization <img src="https://latex.codecogs.com/png.latex?%5Ctilde%7B%5Ctheta%7D=%5Cphi(%5Ctheta)">, the FIM transforms as a <img src="https://latex.codecogs.com/png.latex?(0,2)">-tensor:</p>
<p><span id="eq-reparameterized-fim"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctilde%20F_%7Bkl%7D%0A=%0A%5Csum_%7Bi,j%7D%0AF_%7Bij%7D%0A%5Cfrac%7B%5Cpartial%20%5Ctheta%5Ei%7D%7B%5Cpartial%20%5Ctilde%5Ctheta%5Ek%7D%0A%5Cfrac%7B%5Cpartial%20%5Ctheta%5Ej%7D%7B%5Cpartial%20%5Ctilde%5Ctheta%5El%7D.%0A%5Ctag%7B10%7D"></span></p>
<p>This is exactly the transformation law for a Riemannian metric. The FIM captures intrinsic geometry, not a coordinate artifact. Changing how we label the parameters changes the matrix entries, but not the statistical distance measured by the metric.</p>
</div>
</div>
</section>
<section id="chentsovs-theorem-the-unique-invariant-metric" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="chentsovs-theorem-the-unique-invariant-metric"><span class="header-section-number">4</span> Chentsov’s Theorem: The Unique Invariant Metric</h2>
<p>The pullback construction shows that FIM is natural. A deeper result, Chentsov’s theorem <span class="citation" data-cites="chentsov1982">[5]</span>, shows it is unique.</p>
<div class="math-depth callout callout-style-default callout-tip callout-titled" title="Mathematical Depth">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Mathematical Depth
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong><em>Chentsov’s Theorem.</em></strong></p>
<p><strong>Informal Version.</strong> The Fisher-Rao metric is the unique Riemannian metric on the statistical manifold, up to a positive scalar multiple, that is invariant under statistically sufficient transformations of the observations.</p>
<p>Equivalently: if a transformation of the data preserves all information relevant to <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, then it should preserve the geometry of the model. Fisher-Rao is the only Riemannian metric with this property.</p>
<p><strong>A more formal version (finite-dimensional).</strong> Let</p>
<p><span id="eq-simplex-interior"><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta_n%0A=%0A%5Cleft%5C%7B%0Ap=(p_1,%5Cldots,p_n):%20p_i%3E0,%5C%20%5Csum_%7Bi=1%7D%5En%20p_i=1%0A%5Cright%5C%7D%0A%5Ctag%7B11%7D"></span></p>
<p>be the interior of the probability simplex. This is the finite/discrete setting: points of <img src="https://latex.codecogs.com/png.latex?%5CDelta_n"> are categorical probability distributions over <img src="https://latex.codecogs.com/png.latex?n"> outcomes. Suppose that for each <img src="https://latex.codecogs.com/png.latex?n">, we assign a Riemannian metric <img src="https://latex.codecogs.com/png.latex?g%5E%7B(n)%7D"> on <img src="https://latex.codecogs.com/png.latex?%5CDelta_n">. The assignment is called <em>monotone under Markov maps</em> if, for every stochastic map <img src="https://latex.codecogs.com/png.latex?T:%5CDelta_n%5Cto%5CDelta_m"> (equivalently, a Markov transition matrix),</p>
<p><span id="eq-markov-monotonicity"><img src="https://latex.codecogs.com/png.latex?%0Ag%5E%7B(m)%7D_%7BTp%7D(T_*u,T_*u)%0A%5Cleq%0Ag%5E%7B(n)%7D_p(u,u),%0A%5Ctag%7B12%7D"></span></p>
<p>for every tangent vector <img src="https://latex.codecogs.com/png.latex?u%5Cin%20T_p%5CDelta_n">. In words: applying a noisy data-processing map cannot increase statistical distinguishability.</p>
<p>Here <img src="https://latex.codecogs.com/png.latex?T_*u"> denotes the pushforward of the tangent vector <img src="https://latex.codecogs.com/png.latex?u"> by the map <img src="https://latex.codecogs.com/png.latex?T">: it is the infinitesimal direction obtained after applying the data-processing map.</p>
<p>Chentsov’s theorem says that, up to an overall positive constant, the only such monotone Riemannian metric is the Fisher-Rao metric:</p>
<p><span id="eq-fisher-rao-simplex-metric"><img src="https://latex.codecogs.com/png.latex?%0Ag_p(u,v)%0A=%0Ac%5Csum_%7Bi=1%7D%5En%20%5Cfrac%7Bu_i%20v_i%7D%7Bp_i%7D,%0A%5Cqquad%20c%3E0.%0A%5Ctag%7B13%7D"></span></p>
<p>Equivalently, in a parametric model <img src="https://latex.codecogs.com/png.latex?p(x;%5Ctheta)">, this metric pulls back to the Fisher information matrix</p>
<p><span id="eq-fisher-rao-parametric-pullback"><img src="https://latex.codecogs.com/png.latex?%0AF_%7Bij%7D(%5Ctheta)%0A=%0A%5Cmathbb%20E_%5Ctheta%0A%5Cleft%5B%0A%5Cpartial_i%5Clog%20p(x;%5Ctheta)%5C,%0A%5Cpartial_j%5Clog%20p(x;%5Ctheta)%0A%5Cright%5D.%0A%5Ctag%7B14%7D"></span></p>
<p>Thus Fisher geometry is not merely invariant under reparameterizing <img src="https://latex.codecogs.com/png.latex?%5Ctheta">; it is also the unique geometry that contracts under statistically noisy transformations of the observations.</p>
</div>
</div>
<p>Chentsov is not saying the FIM is a natural choice among many. He is saying it is the <em>only</em> choice consistent with the principle that geometry should not depend on how data is represented. If two experimenters measure the same phenomenon but record their observations differently—and neither loses information—their geometric picture of the statistical model must agree. The FIM is the unique metric with this property.</p>
<p>This is a strong and somewhat surprising result. It says the geometry of statistical inference is not a convention or a convenience: it is forced by the structure of the problem.</p>
<p>This uniqueness result is local. The Fisher metric is an infinitesimal object: it is a tensor at each point of parameter space, giving an inner product on tangent vectors. It measures how fast nearby distributions diverge statistically, but it says nothing directly about distributions that are far apart.</p>
<p>Recovering a global distance requires integrating along geodesics in the Fisher metric, which is generally difficult and depends on the path. For many parametric families (such as Gaussians, exponential families, etc.), geodesics can be computed explicitly. In general, though, they cannot. This locality is a strength (it gives a clean Riemannian theory that is well-defined and coordinate-free) and a genuine limitation (it is not a ready-made global distance between arbitrary distributions).</p>
<p>This is the first major contrast with Wasserstein geometry. Fisher gives a local ruler for statistical distinguishability; Wasserstein will give a global ruler for displacement through the sample space.</p>
</section>
<section id="the-cramér-rao-bound-geometry-in-action" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="the-cramér-rao-bound-geometry-in-action"><span class="header-section-number">5</span> The Cramér-Rao Bound: Geometry in Action</h2>
<p>The Fisher metric does more than organize the geometry of a statistical model. It places hard limits on what any learning or estimation procedure can achieve. A learner navigating parameter space with smaller Fisher metric in some direction cannot acquire information in that direction quickly, no matter how its update rule is designed. The Cramér-Rao bound makes this precise <span class="citation" data-cites="fisher1925 rao1945">[1, 2]</span>.</p>
<p>For an unbiased estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> of a scalar parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta">, the bound states:</p>
<p><span id="eq-cramer-rao-bound"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%0A%5Cgeq%0A%5Cfrac%7B1%7D%7B%5Cmathcal%7BI%7D(%5Ctheta)%7D%0A%5Ctag%7B15%7D"></span></p>
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7BVar%7D_%5Ctheta"> denotes variance under <img src="https://latex.codecogs.com/png.latex?p(%5Ccdot%5C,;%5Ctheta)">.</p>
<p>In vector form, for an unbiased estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> of <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D%5En">:</p>
<p><span id="eq-cramer-rao-bound-vector"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%0A%5Csucceq%0AF(%5Ctheta)%5E%7B-1%7D%0A%5Ctag%7B16%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Coperatorname%7BCov%7D_%5Ctheta"> denotes covariance under <img src="https://latex.codecogs.com/png.latex?p(%5Ccdot%5C,;%5Ctheta)">, and <img src="https://latex.codecogs.com/png.latex?%5Csucceq"> denotes the Loewner partial order on positive semidefinite matrices: <img src="https://latex.codecogs.com/png.latex?A%5Csucceq%20B"> means <img src="https://latex.codecogs.com/png.latex?A-B"> is positive semidefinite. No unbiased estimator can have covariance smaller than <img src="https://latex.codecogs.com/png.latex?F(%5Ctheta)%5E%7B-1%7D">.</p>
<p>For deep learning, this statement should not be read too literally. We are usually not searching for an unbiased estimator of a true parameter vector; neural networks are overparameterized, non-identifiable, and trained for predictive performance rather than classical parameter recovery. The relevance is geometric: the inverse Fisher describes how sensitive the model distribution is to parameter movement. In optimization language, it acts like a curvature-aware preconditioner, shaping stable step sizes and update directions.</p>
<p>The geometric reading is direct: <img src="https://latex.codecogs.com/png.latex?F(%5Ctheta)%5E%7B-1%7D"> is the dual metric on the cotangent space, and the bound says that estimation uncertainty is bounded below by the inverse of the local metric scale of the statistical manifold. Where the Fisher metric is large in a direction—where distributions are highly distinguishable and change rapidly along that direction—estimation is easy. Conversely, where the local distinguishability is small, estimation is hard.</p>
<div class="math-depth callout callout-style-default callout-tip callout-titled" title="Mathematical Depth">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Mathematical Depth
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong><em>Proof sketch.</em></strong> Fix an unbiased estimator <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> of a scalar parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cin%20%5Cmathbb%7BR%7D">. Unbiasedness means</p>
<p><span id="eq-unbiased-estimator-identity"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5B%5Chat%7B%5Ctheta%7D(x)%5D%0A=%0A%5Cint%20%5Chat%7B%5Ctheta%7D(x)p(x;%5Ctheta)%5C,dx%0A=%0A%5Ctheta.%0A%5Ctag%7B17%7D"></span></p>
<p>Differentiate both sides with respect to <img src="https://latex.codecogs.com/png.latex?%5Ctheta">:</p>
<p><span id="eq-differentiate-unbiasedness"><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7D%0A%5Cint%20%5Chat%7B%5Ctheta%7D(x)p(x;%5Ctheta)%5C,dx%0A=%0A1.%0A%5Ctag%7B18%7D"></span></p>
<p>Exchanging derivative and integral gives</p>
<p><span id="eq-unbiasedness-derivative-integral"><img src="https://latex.codecogs.com/png.latex?%0A%5Cint%20%5Chat%7B%5Ctheta%7D(x)%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7Dp(x;%5Ctheta)%5C,dx%0A=%0A1.%0A%5Ctag%7B19%7D"></span></p>
<p>Now use the score identity</p>
<p><span id="eq-density-derivative-score"><img src="https://latex.codecogs.com/png.latex?%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7Dp(x;%5Ctheta)%0A=%0Ap(x;%5Ctheta)%5C,%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7D%5Clog%20p(x;%5Ctheta)%0A=%0Ap(x;%5Ctheta)s(x;%5Ctheta).%0A%5Ctag%7B20%7D"></span></p>
<p>Substituting this into the previous equation gives</p>
<p><span id="eq-exp-score"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5B%5Chat%7B%5Ctheta%7D(x)s(x;%5Ctheta)%5D=1%0A%5Ctag%7B21%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?s=%5Cpartial_%7B%5Ctheta%7D%5Clog%20p"> is the score. We also use Fisher’s identity:</p>
<p><span id="eq-fisher-identity-score-zero"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5Bs(x;%5Ctheta)%5D%0A=%0A%5Cint%20%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7Dp(x;%5Ctheta)%5C,dx%0A=%0A%5Cfrac%7B%5Cpartial%7D%7B%5Cpartial%20%5Ctheta%7D%5Cint%20p(x;%5Ctheta)%5C,dx%0A=%0A0.%0A%5Ctag%7B22%7D"></span></p>
<p>Because the score has mean zero, the covariance between <img src="https://latex.codecogs.com/png.latex?%5Chat%7B%5Ctheta%7D"> and <img src="https://latex.codecogs.com/png.latex?s"> is</p>
<p><span id="eq-cramer-rao-covariance-score"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D,s)%0A=%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5B%5Chat%7B%5Ctheta%7Ds%5D%0A-%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5B%5Chat%7B%5Ctheta%7D%5D%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5Bs%5D%0A=%0A1.%0A%5Ctag%7B23%7D"></span></p>
<p>Now apply the Cauchy-Schwarz inequality to this covariance:</p>
<p><span id="eq-cauchy-schwarz-covariance"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D,s)%5E2%0A%5Cleq%0A%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%5Ccdot%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(s)%0A%5Ctag%7B24%7D"></span></p>
<p>The left side is <img src="https://latex.codecogs.com/png.latex?1">, and the variance of the score is the Fisher information:</p>
<p><span id="eq-score-variance-fisher-information"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(s)%0A=%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5Bs%5E2%5D%0A=%0A%5Cmathcal%7BI%7D(%5Ctheta).%0A%5Ctag%7B25%7D"></span></p>
<p>Therefore,</p>
<p><span id="eq-cramer-rao-scalar-intermediate"><img src="https://latex.codecogs.com/png.latex?%0A1%0A%5Cleq%0A%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%5Cmathcal%7BI%7D(%5Ctheta),%0A%5Ctag%7B26%7D"></span></p>
<p>which is equivalent to</p>
<p><span id="eq-cramer-rao-scalar-derived"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BVar%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%0A%5Cgeq%0A%5Cfrac%7B1%7D%7B%5Cmathcal%7BI%7D(%5Ctheta)%7D.%0A%5Ctag%7B27%7D"></span></p>
<p>The multivariate case follows the same logic, but covariance replaces variance and the score becomes a vector. Let</p>
<p><span id="eq-score-vector"><img src="https://latex.codecogs.com/png.latex?%0As(x;%5Ctheta)%0A=%0A%5Cnabla_%5Ctheta%20%5Clog%20p(x;%5Ctheta).%0A%5Ctag%7B28%7D"></span></p>
<p>For an unbiased vector estimator, differentiating <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%5Ctheta%5B%5Chat%7B%5Ctheta%7D%5D=%5Ctheta"> gives</p>
<p><span id="eq-multivariate-score-identity"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%5Ctheta%5B(%5Chat%7B%5Ctheta%7D-%5Ctheta)s%5E%5Ctop%5D=I.%0A%5Ctag%7B29%7D"></span></p>
<p>This is the vector analogue of the scalar covariance identity above. Applying the matrix Cauchy-Schwarz inequality yields</p>
<p><span id="eq-matrix-cramer-rao-proof"><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D_%7B%5Ctheta%7D(%5Chat%7B%5Ctheta%7D)%20%5Csucceq%20F(%5Ctheta)%5E%7B-1%7D.%0A%5Ctag%7B30%7D"></span></p>
<p>One way to read this step is through a Schur-complement argument: the joint covariance block matrix of the estimator error and the score is positive semidefinite, and using <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%5Ctheta%5B(%5Chat%7B%5Ctheta%7D-%5Ctheta)s%5E%5Ctop%5D=I"> forces the estimator covariance block to dominate the inverse Fisher block. This is exactly the matrix version of the scalar inequality above.</p>
</div>
</div>
<p>The Cramér-Rao bound is one of the oldest results in mathematical statistics but its geometric interpretation—as a statement about the inverse of a Riemannian metric—was not fully articulated until Rao’s formulation <span class="citation" data-cites="rao1945">[2]</span> and the subsequent development of information geometry <span class="citation" data-cites="amari2000">[3]</span>.</p>
</section>
<section id="what-the-fisher-metric-cannot-see" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="what-the-fisher-metric-cannot-see"><span class="header-section-number">6</span> What the Fisher Metric Cannot See</h2>
<p>The Fisher metric is defined entirely within parameter space. It requires a parametric family <img src="https://latex.codecogs.com/png.latex?p(x;%5Ctheta)"> and measures how the model changes as <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> varies. It is sensitive to the statistical shape of the model—how distributions in the family differ from each other—and entirely indifferent to the geometry of the sample space <img src="https://latex.codecogs.com/png.latex?X"> itself.</p>
<p>If you relabel the points of <img src="https://latex.codecogs.com/png.latex?X"> arbitrarily—i.e., apply any measurable bijection to the data—the Fisher metric does not change. This is exactly Chentsov’s invariance property that we touched earlier, and it is both, a strength and a limitation.</p>
<p>It is a strength because it means the geometry is intrinsic to the statistical model, not an artifact of data representation. It is a limitation because it means the Fisher metric has no vocabulary for the spatial layout of the data. To visualize the distinction, imagine a density plot: the horizontal axis is the sample space, while the vertical axis records how much probability mass sits at each location. Fisher sees changes in the heights of that plot, but not the geometry along the horizontal axis. It cannot tell you that the image of a cat shifted two pixels to the right is closer to the original than a completely different image. It lives entirely in the vertical direction, how probability weights change, and is blind to the horizontal direction of where points live in space.</p>
<div class="learning-box callout callout-style-default callout-note callout-titled" title="Consequence for Learning">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Consequence for Learning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Learning algorithms that minimize <em>Kullback-Leibler (KL) divergence</em> (maximum likelihood estimation, variational autoencoders, and most of classical statistical inference) are navigating Fisher geometry locally <span class="citation" data-cites="amari2000 wainwright2008">[3, 6]</span>. This is not merely an analogy; a future post will make the connection precise. For now, the key point is that KL-based objectives are sensitive to the parametric structure of the model and indifferent to the spatial layout of the data.</p>
<p>A related failure appears for <img src="https://latex.codecogs.com/png.latex?f">-divergences such as Jensen-Shannon in early GANs: when supports are disjoint or nearly disjoint, the divergence can saturate and provide weak gradients, motivating Wasserstein-based objectives <span class="citation" data-cites="arjovsky2017">[7]</span>. The failure traces directly to the geometry: locally, KL induces Fisher geometry, and Fisher geometry has no concept of spatial distance.</p>
<p>Natural gradient descent <span class="citation" data-cites="amari1998">[8]</span> corrects the optimization geometry by using <img src="https://latex.codecogs.com/png.latex?F(%5Ctheta)%5E%7B-1%7D"> to rescale the gradient, moving in the steepest direction with respect to the Fisher metric rather than the Euclidean one. This is the geometrically correct update in parameter space—but it is still confined to the vertical, statistical geometry, which we will come back to in a future post.</p>
</div>
</div>
</section>
<section id="preview-fisher-as-local-kl-geometry" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="preview-fisher-as-local-kl-geometry"><span class="header-section-number">7</span> Preview: Fisher as Local KL Geometry</h2>
<p>The Fisher metric does not stand alone. It is the local, symmetric residue of a richer asymmetric object: the KL divergence. This connection—which we will come back to in another blog post—is worth previewing here because it explains why the Fisher metric appears so naturally in learning.</p>
<div class="math-depth callout callout-style-default callout-tip callout-titled" title="Mathematical Depth">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Mathematical Depth
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong><em>Fisher metric as the Hessian of KL.</em></strong> Fix a parametric family <img src="https://latex.codecogs.com/png.latex?p(%5Ccdot%5C,;%5Ctheta)"> and define a local KL function by holding the second argument fixed:</p>
<p><span id="eq-local-kl-function"><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctheta%7D(%5Cdelta)%0A=%0A%5Cmathrm%7BKL%7D%0A%5Cleft(%0Ap(%5Ccdot%5C,;%5Ctheta+%5Cdelta)%0A%5C,%5CBig%5C%7C%5C,%0Ap(%5Ccdot%5C,;%5Ctheta)%0A%5Cright).%0A%5Ctag%7B31%7D"></span></p>
<p>At <img src="https://latex.codecogs.com/png.latex?%5Cdelta=0">, the two distributions agree, so</p>
<p><span id="eq-local-kl-zero"><img src="https://latex.codecogs.com/png.latex?%0AD_%7B%5Ctheta%7D(0)=0.%0A%5Ctag%7B32%7D"></span></p>
<p>The first derivative also vanishes. Intuitively, KL is minimized when the two distributions are the same, so there is no first-order change at <img src="https://latex.codecogs.com/png.latex?%5Cdelta=0">:</p>
<p><span id="eq-local-kl-first-derivative-zero"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft.%0A%5Cfrac%7B%5Cpartial%20D_%7B%5Ctheta%7D(%5Cdelta)%7D%0A%7B%5Cpartial%20%5Cdelta%5Ei%7D%0A%5Cright%7C_%7B%5Cdelta=0%7D%0A=%0A0.%0A%5Ctag%7B33%7D"></span></p>
<p>The second derivative is the Fisher information matrix:</p>
<p><span id="eq-local-kl-hessian-fisher"><img src="https://latex.codecogs.com/png.latex?%0A%5Cleft.%0A%5Cfrac%7B%5Cpartial%5E2%20D_%7B%5Ctheta%7D(%5Cdelta)%7D%0A%7B%5Cpartial%20%5Cdelta%5Ei%20%5Cpartial%20%5Cdelta%5Ej%7D%0A%5Cright%7C_%7B%5Cdelta=0%7D%0A=%0AF_%7Bij%7D(%5Ctheta).%0A%5Ctag%7B34%7D"></span></p>
<p>Therefore, the Taylor expansion begins at second order:</p>
<p><span id="eq-kl-local-fisher-expansion"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathrm%7BKL%7D%0A%5Cleft(%0Ap(%5Ccdot%5C,;%5Ctheta%20+%20%5Cdelta)%0A%5C,%5CBig%5C%7C%5C,%0Ap(%5Ccdot%5C,;%5Ctheta)%0A%5Cright)%0A=%0A%5Cfrac%7B1%7D%7B2%7D%5C,F_%7Bij%7D(%5Ctheta)%5Cdelta%5Ei%20%5Cdelta%5Ej%0A+%20%5Cmathcal%7BO%7D(%5C%7C%5Cdelta%5C%7C%5E3).%0A%5Ctag%7B35%7D"></span></p>
<p>The same second-order term appears if the KL direction is reversed:</p>
<p><span id="eq-kl-local-fisher-expansion-reverse"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathrm%7BKL%7D%0A%5Cleft(%0Ap(%5Ccdot%5C,;%5Ctheta)%0A%5C,%5CBig%5C%7C%5C,%0Ap(%5Ccdot%5C,;%5Ctheta%20+%20%5Cdelta)%0A%5Cright)%0A=%0A%5Cfrac%7B1%7D%7B2%7D%5C,F_%7Bij%7D(%5Ctheta)%5Cdelta%5Ei%20%5Cdelta%5Ej%0A+%20%5Cmathcal%7BO%7D(%5C%7C%5Cdelta%5C%7C%5E3).%0A%5Ctag%7B36%7D"></span></p>
<p>This is why Fisher geometry is the <em>local symmetric part</em> of KL. The asymmetry of KL is real, but it does not show up in the quadratic approximation; it appears only at third order and beyond.</p>
<p>Here <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BO%7D(%5C%7C%5Cdelta%5C%7C%5E3)"> means that the omitted terms are of cubic order or higher in the displacement size <img src="https://latex.codecogs.com/png.latex?%5C%7C%5Cdelta%5C%7C">.</p>
</div>
</div>
<p>This means two things. First, any algorithm that minimizes KL divergence inherits Fisher geometry in its local second-order behavior. Minimizing KL gives Fisher as the local geometry, but the actual update rule may still use Euclidean gradients—this is precisely the gap that natural gradient descent, which will be discussed in future posts, is designed to close. Second, the Fisher metric inherits its symmetry from the Hessian operation: even though KL itself is asymmetric (because <img src="https://latex.codecogs.com/png.latex?%5Cmathrm%7BKL%7D(p%5C%7Cq)%20%5Cneq%20%5Cmathrm%7BKL%7D(q%5C%7Cp)"> in general), its second-order term is a symmetric bilinear form—the metric tensor <img src="https://latex.codecogs.com/png.latex?F_%7Bij%7D">.</p>
<p>The asymmetry of KL beyond second order encodes something important: the two directions of KL correspond to qualitatively different behaviors (mode-covering vs.&nbsp;mode-seeking). I will examine this asymmetry in detail and show how it points directly toward <em>Schrödinger bridge</em>.</p>
<div class="consequence-box callout callout-style-default callout-important callout-titled" title="Consequence for Learning">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Important</span>Consequence for Learning
</div>
</div>
<div class="callout-body-container callout-body">
<p>Many learning objectives compare distributions through KL divergence or likelihood. Locally, these objectives inherit the Fisher metric: directions in parameter space are not equally meaningful, because some directions change the model distribution much more than others.</p>
<p>Ordinary gradient descent ignores this geometry. It treats parameter space as Euclidean, so a step of the same coordinate size is treated as equally important in every direction. Natural gradient descent corrects this by using <img src="https://latex.codecogs.com/png.latex?F(%5Ctheta)%5E%7B-1%7D%5Cnabla%20L">, the steepest descent direction measured in distribution space rather than raw coordinates.</p>
<p>For modern neural networks, however, the full FIM is impossible to form explicitly: with <img src="https://latex.codecogs.com/png.latex?N"> parameters, it is an <img src="https://latex.codecogs.com/png.latex?N%5Ctimes%20N"> matrix. Practical methods therefore use approximations such as empirical Fisher estimates, matrix-free Fisher-vector products, diagonal approximations, low-rank approximations, or structured approximations such as K-FAC. These approximations are an active research area: they change the computation, but not the underlying geometric object.</p>
<p>This is the first practical lesson of Fisher geometry: learning is not just about the slope of the loss, but about the geometry of how parameters change distributions.</p>
</div>
</div>
</section>
<section id="the-fisher-metric-in-the-exponential-family" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="the-fisher-metric-in-the-exponential-family"><span class="header-section-number">8</span> The Fisher Metric in the Exponential Family</h2>
<p>The previous section showed that KL locally contains Fisher geometry in its second-order term. Exponential families are the cleanest setting where this local picture becomes algebraic: the Fisher metric, KL divergence, and convex duality can all be written in terms of a single function, the log-partition function.</p>
<p>The Fisher metric takes a particularly clean form for exponential families, which include Gaussians, Bernoulli, Poisson and several other distributions used in practice.</p>
<p>An exponential family has density</p>
<p><span id="eq-exponential-family-density"><img src="https://latex.codecogs.com/png.latex?%0Ap(x;%5Ctheta)%0A=%0Ah(x)%5Cexp%5C,%5C%7B%5Ctheta%5E%7B%5Ctop%7D%5C,T(x)-A(%5Ctheta)%5C%7D%0A%5Ctag%7B37%7D"></span></p>
<p>where <img src="https://latex.codecogs.com/png.latex?T(x)"> are the sufficient statistics, <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> are the natural parameters, and <img src="https://latex.codecogs.com/png.latex?A(%5Ctheta)"> is the log-partition function. A direct calculation shows that the FIM is the Hessian of <img src="https://latex.codecogs.com/png.latex?A">:</p>
<p><span id="eq-fisher-hessian-log-partition"><img src="https://latex.codecogs.com/png.latex?%0AF_%7Bij%7D%0A=%0A%5Cfrac%7B%5Cpartial%5E2%5C,A%7D%7B%5Cpartial%5Ctheta%5Ei%5C,%5Cpartial%5Ctheta%5Ej%7D.%0A%5Ctag%7B38%7D"></span></p>
<div class="math-depth callout callout-style-default callout-tip callout-titled" title="Mathematical Depth">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Mathematical Depth
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong><em>Derivation.</em></strong> For the exponential family, the log-density is</p>
<p><span id="eq-exponential-family-log-density"><img src="https://latex.codecogs.com/png.latex?%0A%5Clog%20p(x;%5Ctheta)%0A=%0A%5Ctheta%5E%7B%5Ctop%7DT(x)-A(%5Ctheta)+%5Clog%20h(x).%0A%5Ctag%7B39%7D"></span></p>
<p>Differentiating with respect to the natural parameter <img src="https://latex.codecogs.com/png.latex?%5Ctheta%5Ei"> gives the score component</p>
<p><span id="eq-exponential-family-score-component"><img src="https://latex.codecogs.com/png.latex?%0A%5Cpartial_i%20%5Clog%20p(x;%5Ctheta)%0A=%0AT_i(x)-%5Cpartial_i%20A(%5Ctheta).%0A%5Ctag%7B40%7D"></span></p>
<p>The score has mean zero by Fisher’s identity:</p>
<p><span id="eq-exponential-family-score-mean-zero"><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%5Cleft%5B%5Cpartial_i%20%5Clog%20p(x;%5Ctheta)%5Cright%5D%0A=%0A0.%0A%5Ctag%7B41%7D"></span></p>
<p>Therefore,</p>
<p><span id="eq-exponential-family-fisher-covariance"><img src="https://latex.codecogs.com/png.latex?%0AF_%7Bij%7D%0A=%0A%5Cmathbb%7BE%7D_%7B%5Ctheta%7D%0A%5Cleft%5B%0A%20%20%5Cleft(T_i-%5Cpartial_i%20A(%5Ctheta)%5Cright)%0A%20%20%5Cleft(T_j-%5Cpartial_j%20A(%5Ctheta)%5Cright)%0A%5Cright%5D%0A=%0A%5Coperatorname%7BCov%7D_%7B%5Ctheta%7D(T_i,T_j)%0A=%0A%5Cpartial_i%5Cpartial_j%20A(%5Ctheta)%0A%5Ctag%7B42%7D"></span></p>
<p>The FIM is the covariance of the sufficient statistics, or equivalently, the Hessian of the log-partition function <em>with respect to the natural parameters <img src="https://latex.codecogs.com/png.latex?%5Ctheta"></em>. Since <img src="https://latex.codecogs.com/png.latex?A"> is convex, <img src="https://latex.codecogs.com/png.latex?F"> is positive semidefinite, confirming that the FIM is a valid Riemannian metric.</p>
<p><strong><em>Example: Gaussians.</em></strong> The Gaussian <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BN%7D(%5Cmu,%5Csigma%5E2)"> belongs to an exponential family with natural parameters</p>
<p><span id="eq-gaussian-natural-parameters"><img src="https://latex.codecogs.com/png.latex?%0A%5Ctheta_1=%5Cfrac%7B%5Cmu%7D%7B%5Csigma%5E2%7D,%0A%5Cqquad%0A%5Ctheta_2=-%5Cfrac%7B1%7D%7B2%5Csigma%5E2%7D.%0A%5Ctag%7B43%7D"></span></p>
<p>In those coordinates,</p>
<p><span id="eq-gaussian-natural-fisher-hessian"><img src="https://latex.codecogs.com/png.latex?%0AF_%7B%5Ctheta%7D%0A=%0A%5Cnabla%5E2%20A(%5Ctheta).%0A%5Ctag%7B44%7D"></span></p>
<p>To express the same metric in the more familiar coordinates <img src="https://latex.codecogs.com/png.latex?(%5Cmu,%5Csigma%5E2)">, use the tensor transformation rule. If <img src="https://latex.codecogs.com/png.latex?%5Calpha=(%5Cmu,%5Csigma%5E2)"> denotes the new coordinates, then</p>
<p><span id="eq-fisher-coordinate-transform"><img src="https://latex.codecogs.com/png.latex?%0AF%5E%7B(%5Calpha)%7D_%7Bab%7D%0A=%0A%5Csum_%7Bi,j%7D%0AF%5E%7B(%5Ctheta)%7D_%7Bij%7D%0A%5Cfrac%7B%5Cpartial%20%5Ctheta%5Ei%7D%7B%5Cpartial%20%5Calpha%5Ea%7D%0A%5Cfrac%7B%5Cpartial%20%5Ctheta%5Ej%7D%7B%5Cpartial%20%5Calpha%5Eb%7D.%0A%5Ctag%7B45%7D"></span></p>
<p>Applying this transformation gives</p>
<p><span id="eq-gaussian-fisher-metric"><img src="https://latex.codecogs.com/png.latex?%0AF(%5Cmu,%5Csigma%5E2)%0A=%0A%5Cbegin%7Bpmatrix%7D%0A1/%5Csigma%5E2%20&amp;%200%20%5C%5C%0A0%20&amp;%201/(2%5Csigma%5E4)%0A%5Cend%7Bpmatrix%7D%0A%5Ctag%7B46%7D"></span></p>
<p>This is the correct Fisher metric in <img src="https://latex.codecogs.com/png.latex?(%5Cmu,%5Csigma%5E2)"> coordinates, but it is <em>not</em> obtained as the Hessian of <img src="https://latex.codecogs.com/png.latex?A"> with respect to <img src="https://latex.codecogs.com/png.latex?(%5Cmu,%5Csigma%5E2)"> directly—those are not natural parameters. Geodesic distance in the Fisher metric on the space of Gaussians has closed form expression and is well-studied <span class="citation" data-cites="amari2016">[4]</span>.</p>
<p><strong><em>Signpost: dual coordinates.</em></strong> The fact that the Fisher metric on an exponential family is the Hessian of a convex function is deeply connected to the duality structure of information geometry <span class="citation" data-cites="amari2000">[3]</span>. The convex function <img src="https://latex.codecogs.com/png.latex?A"> generates the natural-parameter coordinates <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. Its gradient generates the expectation-parameter coordinates</p>
<p><span id="eq-expectation-parameters"><img src="https://latex.codecogs.com/png.latex?%0A%5Ceta%0A=%0A%5Cnabla%20A(%5Ctheta).%0A%5Ctag%7B47%7D"></span></p>
<p>The Legendre transform of <img src="https://latex.codecogs.com/png.latex?A"> then exchanges these two coordinate systems:</p>
<p><span id="eq-legendre-transform-log-partition"><img src="https://latex.codecogs.com/png.latex?%0AA%5E%7B*%7D(%5Ceta)%0A=%0A%5Csup_%7B%5Ctheta%7D%0A%5Cleft%5C%7B%0A%5Ctheta%5E%7B%5Ctop%7D%5Ceta%20-%20A(%5Ctheta)%0A%5Cright%5C%7D.%0A%5Ctag%7B48%7D"></span></p>
<p>In the same exponential-family setting, KL divergence between two members of the family is the Bregman divergence generated by <img src="https://latex.codecogs.com/png.latex?A">:</p>
<p><span id="eq-bregman-divergence-log-partition"><img src="https://latex.codecogs.com/png.latex?%0AD_A(%5Ctheta',%5Ctheta)%0A=%0AA(%5Ctheta')%0A-A(%5Ctheta)%0A-%5Cnabla%20A(%5Ctheta)%5E%7B%5Ctop%7D(%5Ctheta'-%5Ctheta).%0A%5Ctag%7B49%7D"></span></p>
<p>This is only a signpost for now. The important point for this post is that Fisher geometry is not isolated: it sits inside a larger convex-dual structure that will return when we discuss KL divergence, natural gradients, and learning dynamics.</p>
</div>
</div>
</section>
<section id="a-first-uniqueness-result" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="a-first-uniqueness-result"><span class="header-section-number">9</span> A First Uniqueness Result</h2>
<p>I close this post by returning to Chentsov’s theorem, because it gives Fisher geometry a special status.</p>
<p>The Fisher metric is not just a convenient way to measure local distinguishability. Given one natural requirement—that the geometry of a statistical model should not depend on how the data are represented, as long as no statistical information is lost—the Fisher-Rao metric is forced. Up to a constant scale factor, it is the only Riemannian metric with this invariance property.</p>
<p>That is the main lesson of this first ruler. Fisher information does not merely assign numbers to estimators or appear in asymptotic statistics. It defines the unique local geometry compatible with statistical distinguishability. If we care about how distributions change as models learn, this is the geometry that the model carries.</p>
<p>But this uniqueness also explains Fisher’s limitation. Because it is invariant under sufficient transformations of the sample space, Fisher geometry is blind to the spatial arrangement of the sample space itself. It knows how probability weights change; it does not know how far mass has moved.</p>
<p>This is exactly where the second ruler enters. The next post turns to Wasserstein geometry, where distance is not measured by distinguishability but by displacement. Brenier’s theorem <span class="citation" data-cites="brenier1991">[9]</span> will give a second kind of uniqueness: under quadratic transport cost, the optimal way to move mass is not arbitrary, but generated by the gradient of a convex potential.</p>
</section>
<section id="what-comes-next" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="what-comes-next"><span class="header-section-number">10</span> What Comes Next</h2>
<p>This post introduced the first ruler: Fisher information, the local geometry of statistical distinguishability. The next post introduces the second ruler: Wasserstein distance, the global geometry of displacement.</p>
<p>Together, these two rulers set up the main tension of the series. Fisher asks how distributions differ statistically; Wasserstein asks how probability mass moves spatially. Learning needs both perspectives.</p>
<p>From there, the story becomes more interesting. KL divergence will appear as the asymmetric object underlying Fisher geometry. Schrödinger bridges will connect entropy and transport into a single stochastic path. Hodge decomposition will reveal the algebraic anatomy of optimal flows. Natural gradient will show what it means for optimization to respect statistical geometry. Langevin dynamics will then add calibrated noise, turning gradient descent from a mode-seeking optimizer into a sampler over distributions.</p>
<p>Finally, these ideas will reappear in modern machine learning, where diffusion models, flow matching, natural gradients, Langevin samplers, and Bayesian methods can all be understood as approximations to pieces of this geometry.</p>
<p>The goal of this series is not to argue that these subjects are secretly the same theory. They are not. Rather, the claim is that they discovered compatible pieces of the same geometric picture: information provides the local ruler, transport provides the spatial ruler, KL provides direction, entropy provides stochastic paths, Langevin connects optimization to sampling, and topology explains the obstructions.</p>
<div class="post-end">
<p>༺ The End ༻</p>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body" data-entry-spacing="0">
<div id="ref-fisher1925" class="csl-entry">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">R. A. Fisher. <span>“Theory of statistical estimation.”</span> <em>Mathematical Proceedings of the Cambridge Philosophical Society</em>, <strong>22</strong>(5), pp. 700–725, 1925.</div>
</div>
<div id="ref-rao1945" class="csl-entry">
<div class="csl-left-margin">[2] </div><div class="csl-right-inline">C. R. Rao. <span>“Information and the accuracy attainable in the estimation of statistical parameters.”</span> <em>Bulletin of the Calcutta Mathematical Society</em>, <strong>37</strong>, pp. 81–91, 1945.</div>
</div>
<div id="ref-amari2000" class="csl-entry">
<div class="csl-left-margin">[3] </div><div class="csl-right-inline">S. Amari and H. Nagaoka. <em>Methods of information geometry</em>. American Mathematical Society, 2000.</div>
</div>
<div id="ref-amari2016" class="csl-entry">
<div class="csl-left-margin">[4] </div><div class="csl-right-inline">S. Amari. <em>Information geometry and its applications</em>. Springer, 2016.</div>
</div>
<div id="ref-chentsov1982" class="csl-entry">
<div class="csl-left-margin">[5] </div><div class="csl-right-inline">N. N. Chentsov. <em>Statistical decision rules and optimal inference</em>. American Mathematical Society, 1982.</div>
</div>
<div id="ref-wainwright2008" class="csl-entry">
<div class="csl-left-margin">[6] </div><div class="csl-right-inline">M. J. Wainwright and M. I. Jordan. <span>“Graphical models, exponential families, and variational inference.”</span> <em>Foundations and Trends in Machine Learning</em>, <strong>1</strong>(1–2), pp. 1–305, 2008.</div>
</div>
<div id="ref-arjovsky2017" class="csl-entry">
<div class="csl-left-margin">[7] </div><div class="csl-right-inline">M. Arjovsky, S. Chintala, and L. Bottou. <span>“<span>W</span>asserstein generative adversarial networks.”</span> <em>Proceedings of the 34th international conference on machine learning</em>, 2017, pp. 214–223.</div>
</div>
<div id="ref-amari1998" class="csl-entry">
<div class="csl-left-margin">[8] </div><div class="csl-right-inline">S. Amari. <span>“Natural gradient works efficiently in learning.”</span> <em>Neural Computation</em>, <strong>10</strong>(2), pp. 251–276, 1998.</div>
</div>
<div id="ref-brenier1991" class="csl-entry">
<div class="csl-left-margin">[9] </div><div class="csl-right-inline">Y. Brenier. <span>“Polar factorization and monotone rearrangement of vector-valued functions.”</span> <em>Communications on Pure and Applied Mathematics</em>, <strong>44</strong>(4), pp. 375–417, 1991.</div>
</div>
</div></section></div> ]]></description>
  <category>information geometry</category>
  <category>fisher information</category>
  <category>pullback metric</category>
  <category>riemannian metric</category>
  <guid>https://grvkhnl.github.io/posts/the-first-ruler-of-distinguishability/</guid>
  <pubDate>Fri, 05 Jun 2026 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
