Fano’s inequality
Consider a three random variables forming a Markov chain,
\[ X \mapsto Y \mapsto \widehat{X} \tag{1}\]
in the sense that \(Y = \textrm{function}(X, \text{noise})\) and \(Z = \textrm{function}(Y, \text{noise})\). Typical situations include:
We select a parameter \(\theta\) for a probabilistic model \(\mathop{\mathrm{\mathbb{P}}}_{\theta}\). Afterward, we collect data \(X\) from this model, and our goal is to estimate the parameter \(\theta\) solely from the data \(X\).
We generate data \(X\), compress this data into \(X_{\text{zip}}\), and then attempt to recover the original data \(X\) as closely as we can.
Since each step in Equation 1 destroys some information (eg. data processing), it is important to measure how accurately \(\widehat{X}\) estimates the initial input, \(X\). In other words, we want to know how much more information (expressed as ‘bits’) we need to reconstruct \(X\) using knowledge of \(\widehat{X}\) alone, i.e. we would like to upper-bound \(H(X \, | \widehat{X})\). For this purpose, imagine an “error” variable \(E\) that indicates whether \(\widehat{X}\) perfectly matches \(X\),
\[ E = \mathbf{1} {\left( \widehat{X} \neq X \right)} . \]
The probability of error is \(p_E = \mathop{\mathrm{\mathbb{P}}}(\widehat{X} \neq X)\) and \(E = \text{Bern}(p_E)\). To estimate \(X\) from \(\widehat{X}\), we can start by learning if \(\widehat{X}\) equals \(X\), which costs us \(H(E | \widehat{X}) \leq H(E) = h_2(p_E)\) ‘bits’ of information. If it turns out that \(E=0\), we are done asking. If we find that \(E=1\), however, we need to ask additional \(H(X | \widehat{X}, E)\) questions. Crucially, \(H(X | \widehat{X}, E) \leq H(X)\), but also \(H(X | \widehat{X}, E) \leq \log_2(|\mathcal{X}|-1)\) since \(X\) can take any value in \(\mathcal{X}\) except \(\widehat{X}\) when \(E=1\). Writing this reasoning quantitatively gives Fano’s inequality:
\[ \begin{align} H(X | \widehat{X}) &\leq h_2(p_E) + p_E \, \log_2(|\mathcal{X}|-1). \end{align} \tag{2}\]
Apparently, this inequality was first derived by Robert Fano in the 50s while teaching a Ph.D. seminar at MIT. In words: a large \(H(X | \widehat{X})\) means that \(\widehat{X}\) offers insufficient information about \(X\), and as a result, the probability of error \(p_E\) must be high.
Applications:
- Converse of Shannon’s coding theorem