Shannon Entropy & Compression
If Alice chooses a number uniformly at random from the set , Bob can use a simple “dichotomy” strategy to ask Alice binary Yes/No questions and correctly identify the number. This result can be generalized to non-uniform distributions (cf. Huffman codes, and also Kraft-McMillan inequality). If Alice chooses a number from with probabilities , Bob can design a deterministic strategy to find the answer using, on average, about
binary questions, ie. bits. To be more precise, there are strategies that require at most questions on average, and none that can require less than . Note that applying this remark to an iid sequence and using the the fact that , this shows that one can exactly determining the sequence with at most binary questions on average. The quantity defined in Equation 1, known as the Shannon Entropy of the distribution , also implies that there are strategies that can encode each integer as a binary string of length (i.e. with bits), with the expected length approximately equal to . It is because a sequence of binary questions can be thought of as a binary tree, etc…
This remark can be used for compression. Imagine a very long sequence of iid samples from . Encoding each with bits, one should be able to encode the resulting sequence with
bits. Can the usual zip compression algorithm do this? To test this, choose a probability distribution on , generate an iid sequence of length , compress this using the command, and finally look at the size of the resulting file (in bits). I have done that a few times, with a few different values of and a few random distributions on , and with . The plot of size of the compressed files versus the Shannon entropy looks as below:
Seems like the zip-algorithm works almost optimally for compressing iid sequences.
Sequence of random variables
Now consider a pair of discrete random variables . If Alice draws samples from this pair of rvs, one can ask binary questions on average to exactly find out these values. To do that, one can ask questions to estimate , and once is estimated, one can then ask about to estimate . This strategy requires on average binary questions and is actually optimal, showing that
where we have defined .
Indeed, one can generalize these concepts to more than two random variables. Iterating Equation 2 shows that the trajectory of a stationary ergodic Markov chain can be estimated on average with binary questions where
Here, is the equilibrium distribution of the Markov chain and are the transition probabilities.
Can compress Markovian trajectories and roughly achieve that level of compression? Indeed, one can test this by generating random Markov transition matrices (that are, with very high probability, ergodic with respect to an equilibrium distribution ). Doing this with trajectories of length (ie. quite short because it is quite slow to) on with , one get the following results:
In red is the entropy estimated without using the Markovian structure and assuming that the are iid samples. One can clearly see that the dependencies between successive samples are exploited for better compression. Nevertheless, the optimal compression rate given by the Shannon entropy is not reached. Indeed, is not an optimal algorithm – it cannot even compress well enough the sequence !
Asymptotic Equipartition Property (AEP)
The AEP is simple remark that gives a convenient way of reasoning about long sequences random variables with . For example, assuming that the random variables are independent and identically distributed as the random variable , the law of large numbers (LLN) gives that
This means that any “typical” sequence has a probability about of occurring, which also means that there are about such “typical” sequences. Indeed, one could use the LLN for Markov chains and obtain a similar statement for Markovian sequences. All this can be made rigorous with large deviation theory, for example. This also establishes the link with the “statistical physics” definition of entropy as the logarithm of the number of configurations… Set of the type
are usually called typical set: for any , the probability of to belongs to goes to one as . For these reasons, it is often a good heuristic to think of a draw of as a uniformly distributed on the associated typical set. For example, if are iid draws from a Bernoulli distribution with , the set of sequences such that has elements where is the entropy of a random variable.
Mutual information
Consider a pair of random variables . Assuming that stores (on average) bits of useful information, how much of this information can be extracted from ? Let us call this quantity since we will see in a second that this quantity is symmetric. If is independent from , no useful information about is contained in and . On the contrary, if , the knowledge of already contains all the information about and . If one knows , one needs on average binary questions (ie. bits of additional information) in order to determine certainly and recover all the information contained in . This means that the knowledge of already contains useful bits of information about ! This quantity is called the mutual information of the two random variable and , and it has the good taste of being symmetric:
Naturally, one can define conditional version of it by setting where has the law of conditioned on . Since is the reduction in uncertainty of due to when is given, there are indeed situations when is larger than – it is to be contrasted to the intuitive inequality , which is indeed true. A standard such examples is when and are independent random variables and : a short computation gives that while, indeed, . This definition of conditional mutual information leads to a chain-rule property,
which can indeed be generalized to any number of variables. Furthermore, if the are conditionally independent given (eg. if and only depend on ), then the sub-additivity of the entropy readily gives that
Importantly, algebra shows that can also be expressed as the Kullback-Leibler divergence between the joint distribution and the product of the marginals ,
This diagram from (MacKay 2003) nicely illustrate the different fundamental quantities and and and and :
Naturally, if one considers three random variables forming a “Markov chain”, we have the so-called data-processing inequality,
The first inequality is clear since all the useful information contained in must be coming from , and only contains bits about . For the second inequality, note that if contains bits about , and contains bits about , then cannot contain more than bits of :
References
MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.