Information Theory: Entropy and Basic Definitions

infoTheory
Published

23 09 2023

Modified

27 09 2023

Claude Shannon (1916 – 2001)

Shannon Entropy & Compression

If Alice chooses a number X uniformly at random from the set {1,2,,N}, Bob can use a simple “dichotomy” strategy to ask Alice log2(N) binary Yes/No questions and correctly identify the number. This result can be generalized to non-uniform distributions (cf. Huffman codes, and also Kraft-McMillan inequality). If Alice chooses a number X from {1,2,,N} with probabilities P(X=k)=pk, Bob can design a deterministic strategy to find the answer using, on average, about

(1)H(X)=k=1Npklog2(pk)

binary questions, ie. bits. To be more precise, there are strategies that require at most H(X)+1 questions on average, and none that can require less than H(X). Note that applying this remark to an iid sequence X1:T=(X1,,XT) and using the the fact that H(X1,,XT)=TH(X), this shows that one can exactly determining the sequence X1:T with at most TH(X)+1 binary questions on average. The quantity H(X) defined in Equation 1, known as the Shannon Entropy of the distribution (p1,,pN), also implies that there are strategies that can encode each integer 1xN as a binary string of length L(x) (i.e. with L(x) bits), with the expected length E[L(X)] approximately equal to H(X). It is because a sequence of binary questions can be thought of as a binary tree, etc…

This remark can be used for compression. Imagine a very long sequence (X1,,XT) of iid samples from X. Encoding each Xi with L(Xi) bits, one should be able to encode the resulting sequence with

L(X1)++L(XT)TE[L(X)]TH(X)

bits. Can the usual zip compression algorithm do this? To test this, choose a probability distribution on {1,,N}, generate an iid sequence of length T1, compress this using the gzip() command, and finally look at the size of the resulting file (in bits). I have done that a few times, with a few different values of 16N256 and a few random distributions on {1,,N}, and with T=106. The plot of size of the compressed files versus the Shannon entropy H looks as below:

Seems like the zip-algorithm works almost optimally for compressing iid sequences.

Sequence of random variables

Now consider a pair of discrete random variables (X,Y). If Alice draws samples from this pair of rvs, one can ask H(X,Y) binary questions on average to exactly find out these values. To do that, one can ask H(X) questions to estimate X, and once X=x is estimated, one can then ask about H(Y|X=x)=yP(Y=y|X=x)log2(P(Y=y|X=x)) to estimate Y. This strategy requires on average H(X)+xP(X=x)H(Y|X=x) binary questions and is actually optimal, showing that

(2)H(X,Y)=H(X)+H(Y|X)

where we have defined H(Y|X)=xP(X=x)H(Y|X=x).

Indeed, one can generalize these concepts to more than two random variables. Iterating Equation 2 shows that the trajectory X1:T(X1,,XN) of a stationary ergodic Markov chain can be estimated on average with H(X1:T) binary questions where

H(X1:T)=H(X1)+H(X2|X1)++H(XT|Xt1)TH(Xk+1|Xk)=Txπ(x)yp(xy)log2[p(xy)].

Here, π(dx) is the equilibrium distribution of the Markov chain and p(xy) are the transition probabilities.

Can gzip() compress Markovian trajectories and roughly achieve that level of compression? Indeed, one can test this by generating random Markov transition matrices (that are, with very high probability, ergodic with respect to an equilibrium distribution π). Doing this with trajectories of length 104 (ie. quite short because it is quite slow to) on {1,,N} with 2N64, one get the following results:

In red is the entropy estimated without using the Markovian structure and assuming that the Xi are iid samples. One can clearly see that the dependencies between successive samples are exploited for better compression. Nevertheless, the optimal compression rate given by the Shannon entropy is not reached. Indeed, gzip() is not an optimal algorithm – it cannot even compress well enough the sequence (1,2,3,1,2,3,1,2,3,)!

Asymptotic Equipartition Property (AEP)

The AEP is simple remark that gives a convenient way of reasoning about long sequences random variables X1:T=(X1,,XT) with T1. For example, assuming that the random variables Xi are independent and identically distributed as the random variable X, the law of large numbers (LLN) gives that

1Tlog2p(X1:T)=1Tlog2p(Xi)H(X).

This means that any “typical” sequence has a probability about 2TH(X) of occurring, which also means that there are about 2TH(X) such “typical” sequences. Indeed, one could use the LLN for Markov chains and obtain a similar statement for Markovian sequences. All this can be made rigorous with large deviation theory, for example. This also establishes the link with the “statistical physics” definition of entropy as the logarithm of the number of configurations… Set of the type

Aε={x1:T:|1Tlog2p(x1:T)H(X)|<ε}

are usually called typical set: for any ε>0, the probability of X1:T to belongs to Aε goes to one as T. For these reasons, it is often a good heuristic to think of a draw of (X1,,XT) as a uniformly distributed on the associated typical set. For example, if (X1,,XN) are N iid draws from a Bernoulli distribution with P(X=1)=1P(X=0)=p, the set A{0,1}N of sequences such that x1++xN=Np has (NNp)2Nh2(q) elements where h2(q)=[qlog2(q)+(1q)log2(q)] is the entropy of a Bern(q) random variable.

Mutual information

Consider a pair of random variables (X,Y). Assuming that X stores (on average) H(X) bits of useful information, how much of this information can be extracted from Y? Let us call this quantity I(X;Y) since we will see in a second that this quantity is symmetric. If Y is independent from X, no useful information about X is contained in Y and I(X;Y)=0. On the contrary, if X=Y, the knowledge of Y already contains all the information about Y and I(X;Y)=H(X)=H(Y). If one knows Y, one needs on average H(X|Y) binary questions (ie. bits of additional information) in order to determine X certainly and recover all the information contained in X. This means that the knowledge of Y already contains I(X;Y)=H(X)H(X|Y) useful bits of information about X! This quantity is called the mutual information of the two random variable X and Y, and it has the good taste of being symmetric:

I(X;Y)=H(X)H(X|Y)=H(X)+H(Y)H(X,Y).

Naturally, one can define conditional version of it by setting I(X;Y|Z)=zP(Z=z)I(Xz|Yz) where (Xz,Yz) has the law of (X,Z) conditioned on Z=z. Since I(X;Y|Z) is the reduction in uncertainty of X due to Y when Z is given, there are indeed situations when I(X;Y|Z) is larger than I(X;Y) – it is to be contrasted to the intuitive inequality H(X|Z)H(X), which is indeed true. A standard such examples is when X and Y are independent Bern(1/2) random variables and Z=X+Y: a short computation gives that I(X;Y|Z)=1/2 while, indeed, I(X;Y)=0. This definition of conditional mutual information leads to a chain-rule property,

I(X;(Y1,Y2))=I(X;Y1)+I(X;Y2|Y1),

which can indeed be generalized to any number of variables. Furthermore, if the Yi are conditionally independent given X (eg. if X=(X1,,XT) and Yi only depend on Xi), then the sub-additivity of the entropy readily gives that

I(X;(Y1,,YN))i=1NI(X;Yi).

Importantly, algebra shows that I(X;Y) can also be expressed as the Kullback-Leibler divergence between the joint distribution P(X,Y) and the product of the marginals PXPY,

I(X;Y)=DKL((X,Y)XY).

This diagram from () nicely illustrate the different fundamental quantities H(X) and H(X,Y) and H(Y|X) and I(X;Y) and H(X,Y):

Naturally, if one considers three random variables XYZ forming a “Markov chain”, we have the so-called data-processing inequality,

I(X;Z)I(X;Y)andI(X;Z)I(Y;Z).

The first inequality is clear since all the useful information contained in Z must be coming from Y, and Y only contains I(X;Y) bits about X. For the second inequality, note that if Z contains I(Y;Z) bits about Y, and Y contains H(Y;X) bits about X, then Z cannot contain more than I(Y;Z) bits of X:

Data Processing Inequality for Markov XYZ

References

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.