David MacKay

Information Theory, Inference, and Learning Algorithms

Search :


Extra exercises

Information theory

  1. Solve the weighing problem for N coins of which one is odd, and known to be lighter. You may find it interesting to minimize the average number of weighings required, (rather than the maximum number) for the cases N=6 and N=7. [D. G. Mead, The average number of weighings to locate a counterfeit coin, IEEE Trans IT Sept 1979, vol 25, no 5, 616-617]
  2. Efficient HIV tests. $N$ individuals give blood samples for HIV testing. The prior probability that individual n is HIV positive is p_n. For simplicity assume all p_n are 1/100 and consider N=700 as an example. HIV tests are expensive. Instead of testing each of the N samples separately, can you save tests by mixing subsamples together and testing them simultaneously? Assume that the result of a merged test is the simple 'or' of the binary variables x_n that state whether the individuals are HIV positive.

  3. The task of sorting N objects by comparing them two at a time is one of the cornerstones of computer science algorithms. Finding the correct order from the N! possible orders requires log2(N!) bits of information. So any method based on comparing two objects must require log2(N!) or more comparisons. There are many sorting algorithms whose expected complexity (assuming randomly ordered input) is of order N log N (for example, Heapsort and Quicksort are famous). But in practice, when sorting say 3000 objects, Heapsort uses more than twice as many comparisons as Quicksort! And Quicksort is far from perfect: on average it uses 1.39 times as many comparisons as the ideal number [log2(N!)]. Use information theory to criticise Heapsort and Quicksort. Design a sorting algorithm that is better in the sense that it uses fewer comparisons, on average.
    Solution by David MacKay

  4. How many binary comparisons are required to sort 5 objects?
    [Hint: 5! = 120 < 128]
    Can you find a method that guarantees to sort them in 7 comparisons?

  5. Mutual information. Consider a channel with input $x$ and output $y$. Let |Y| = some integer greater than 5. Let |X| = |Y| choose 2.

    Let the marginal distribution of X be uniform. Let the transition matrix from X to Y be a weight-two matrix like this: (with every 1 entry replaced by 0.5)

    1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
    0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0
    0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0
    0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0
    0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
    0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 1
    [rows are labelled by y, cols by x.]

    Now, assume that one input $x$ is selected randomly and transmitted twice over the channel. Let $y_1$ and $y_2$ be the two outputs. Find the mutual informations $I(X;Y_1)$ and $I(Y_1;Y_2)$.


     I(Y1:Y2) = log|Y| - ( 1 + 0.5 log(|Y|-1) )
             I(X:Y1) - H(Y1|X) = log|Y| - 2
  6. A correlation puzzle [Hard]
  7. Investigate the two codes used on CDs, and quantify how far each one is from the capacity of an appropriate channel model. [CDs use an error-correcting code and a run-length limiting code. The ECC is composed of three interleaved reed-solomon codes; typical parameters in bytes are (N,K)=(28,24) and (32,28). The inner run-length-limiting code is known as "EFM", which stands for eight-to-fourteen modulation; this code converts every eight bits into a string of seventeen bits (yes, seventeen, not fourteen!) in such a way that the smallest run length is 3 bits, the average run length is 7 bits, and the maximum runlength is 11 bits.]
  8. Entropy inequalities The order-$r$ {R\'enyi} entropy is
    H^{(r)}(p) = (1/(1-r)) log ( Sum_i p_i^r ) , for r ¬eq; 1.
    Using Jensen's inequality, prove that, for any $r$ this function of $p$ is a bound on the Shannon entropy of $p$. [Hint: I found the case $r=2$ easiest to start with.]
  9. divergences and distances
    Show, by constructing example probability distributions a, b, c, that the relative entropy D_KL does not satisfy the triangle inequality that 'distances' satisfy. (D(a,c) <= D(a,b) + D(b,c).)
    What about the symmetrized relative entropy, D_SRE(a,b) = 0.5(D_KL(a,b) + D_KL(b,a))?
    Prove that D_SRE(a,b) is a distance, or prove, by constructing example probability distributions a, b, c, that the D_SRE does not satisfy the triangle inequality that 'distances' satisfy.
  10. Compression One simple way to encode a binary file of length N=10,000 in which a fraction f=0.01 of the bits are 1s is to send the 100 or so positions of the 1s using 14-bit integers. This code has redundancy because there are many ways (roughly 100! = 100x99x98x...x1) of encoding one file, since the positions can be conveyed in any order. In the bits back coding method, this redundancy is exploited by piggy-backing other data onto the transmission. In this case, another source of bits can be used to choose from the 100! possible orders of the bits. The receiver can then recover the extra bits once the message has been received. Implement this bits-back method. Can you find a way to implement bits-back coding such that the `extra bits' are actually the original coded positions? (A bit like a snake swallowing itself!)
  11. Coding for a redundant language.
     [ABC] ->  2
     [DEF] ->  3
     [GHI] ->  4
     [JKL] ->  5
     [MNO] ->  6
     [PQRS] -> 7
     [TUV] ->  8
     [WXYZ] -> 9
    The standard mobile phone encoder
    When sending text on mobile phones, some people use the iTap/T9/predictive text system, in which, for example, HELLO is written as 43556, and, since no other word in the dictionary is encoded as 43556, the phone deduces that you wanted HELLO. However, some words collide under this T9 encoding, for example KISS and LIPS are both 5477.

    Investigate how much better the T9 system would work if a different encoding of the alphabet A-Z onto the numbers 2..9 were used.

  12. Capacity of a simple deletion channel. A channel has 8 input bits and the output is always exactly 7 bits. One of the 8 bits is deleted, at random. The other 7 are transmitted without error. (Note the difference from an erasure channel, which replaces the erased bits by "?".) Estimate the capacity of this channel. Is the optimal input distribution uniform? Find the highest rate code that you can, that communicates reliably over this channel.
  13. Distance properties of codes, related to erasure channels. Imagine a block code with symbols drawn from an alphabet A has the parameters (N, K, d), meaning block length N, number of codewords (2**K), and minimum distance d. Here, 'distance' means 'the number of symbols in which two words differ from each other'. The erasure channel replaces entire symbols by an erasure symbol. Prove the following theorem: 1. An (N,K,d) code can correct any (d-1) erasures in a block. 2. But there exist some erasures of d symbols that the code cannot correct.
  14. Solution: 1. Imagine that E <= d-1 erasures have occurred. The receiver knows that the true codeword "t" is in the set of (|A|**E) candidate strings that can be created by replacing all erasures in all possible ways. The question is, could erasure correction fail? It can only fail if there exists (at least) one other codeword in that set. We proceed by contradiction. Assume that there is another such rival codeword, u. u and t are identical in (N-E) positions, and can differ in at most E positions. But if E <= d-1 that would contradict the given distance property of the code. Hence there cannot be a rival codeword u. 2. Take two words t, u, that differ in d places. Transmit t, and Erase those d symbols. The receiver cannot know which of t or u was transmitted. QED

Advanced Information Theory

  1. Channel with frozen bits [5].
    Imagine a memory device having the following property: of the N bits, a fraction $s$ are frozen to particular values, randomly chosen from {0,1}. It's not known in advance which bits will be stuck. The encoder can find out which bits are stuck at encoding time. The decoder can only read out the values of the N bits. To make life more difficult, we might assume that the read-out process is noisy, with a flip probability $f$ for all bits. What is the capacity of this channel? How should the encoder set the non-frozen bits in order to achieve reliable communication of $K$ source bits? Can you make a practical encoding and decoding algorithm?


  1. Cumulative distributions (inference of) You've got a sample $\{ x_n \}_{n=1}^N$ of points assumed to come from a distribution, and you want to show what you believe the cumulative distribution looks like, given these points. We want not only a guess but also error bars. Perhaps you want to compare two cumulative distributions with each other and quantify whether they seem to be `significantly different from each other'. How do you display a guess for the cumulative distribution function, with error bars?
    Solution (one page postscript) Solution (one page pdf)
  2. Message passing
    Imagine you want to help out a dictation school, where students are given dictation, then the teacher has to count how many errors each student made in what they wrote.
    Write an inference algorithm that aligns one piece of text with another, assuming that the second one differs from the first by random insertions, deletions and substitutions. Assume a small probability of insertion per character, p_i, a similar probability of deletion, p_d, and a substitution probability p_s. Find both the most probable single alignment (using the min-sum algorithm, aka Viterbi), and the probability distribution of the alignment at any particular character in the noisy sequence (using the sum-product algorithm). Can you make a program that, as the second user writes his text, computes on the fly where the second user has probably got up to? Remember, the aim is for efficient computation, so try to make a method whose cost is at most proportional to N per iteration, where N is the length of the first piece of text.
  3. Message passing II
    Imagine that N coins, with biases pn, are tossed, once each; they are independent but have different biases. The outcome is a binary vector x. The outcomes (1/0) are summed, and the sum is c. What is the probability, given this sum, that coin n had outcome 1? Use a message-passing method to answer this question, and to generate efficiently a sample from the posterior distribution of x. [Hint: one option is to make a trellis in which the vertical coordinate is the partial sum after the nth outcome is added to the total; the horizontal coordinate runs over the n coins. Run a forward pass to find the probability of each outcome conceivable outcome c; run a backward pass from the particular value of c that occurred; then (just like the path-counting example) use the product of the forward and backward messages to pick a path through the trellis from the posterior distribution.]
  4. Metropolis method Assume the data set {x} = { 1.1, 0.5, 1.8, 2.1, 0.3 } come independently from an exponential distribution P(x|lamba) = exp(-x/lamba)/Z(lambda). Use a Metropolis method to draw samples from the posterior distribution of lambda. (Pick an appropriate prior on lambda.)
  5. Reversibility Show, by constructing a simple example, that Gibbs sampling (in which the K variables are updated in a fixed sequence) is not reversible, even though it is composed of a sequence of K transition operators that are reversible.
  6. Change point inference. Make some data that are a noisy version of a simple function with a discontinuity.
     eg, y = (t < t0) ? y0 : y1
    Add Gaussian noise to y to get N data points {d}. Now imagine that you do not know the time t0 of the discontinuity. Infer it.
  7. Hidden Markov Model. Implement an algorithm that, given parameters pi, A, B of a hidden Markov model, and given a data sequence x, will
    1. find the posterior probability of each state at each time [using sum-product algorithm]
    2. find the most probable state-sequence [max-product algorithm, aka min-sum, aka Viterbi]
    3. draw a sample from the posterior distribution of the hidden sequence [using information from the sum-product algorithm].
    Construct a simple example that illustrates why the sequence found by the min-sum (Viterbi) algorithm may be a poor way of representing the posterior distribution.
  8. Life in high dimensions. Consider a hypersphere of radius 1 and a hypercube of radius r (i.e., width of cube is 2r, and diameter of sphere is 2), both in K dimensions. Set K to 256. Use simple Monte Carlo methods to find
    1. The fraction of the hypersphere that is enclosed inside the hypercube as a function of r, for r from 0 to 1. How small can r go, and still give 95% of the hypersphere inside the cube?
    2. The fraction of the hypercube that is enclosed in the hypersphere as a function of r, for r from 0 to 1. When the cube contains 95% of the hypersphere, what fraction of the cube is inside the hypersphere?

    (Exercise by John Platt; for further reading see his 2004 paper on bit-indexing.)

Decision theory

  1. In financial mathematics, it is often assumed that a stock price will vary in accordance with a random walk with a drift parameter mu and a volatility parameter sigma.
    d log S = mu dt + sigma dz
    where dz is a Wiener process with mean zero and variance dt. An option is a financial instrument giving the owner the right (but not the obligation) to buy or sell something. For example, a gambler interested in owning a stock in case its value goes up a lot might buy an option to buy the stock at some future time $T$ for a fixed price $K$ (known as the strike price). If the price indeed is above $K$ at the expiration time, the gambler can exercise the option and turn an instant profit of $S_T - K$ by immediately selling the stock. The gambler has to pay an appropriate price for this option. What should the price of an option be? A `European' option can be exercised only at its expiration time $T$; an `American' option may also be exercised at any time prior to $T$. Chop time up into $N$ steps of duration $T/N$ and use a discrete-time message-passing algorithm to work out the correct price for an option of each type, as a function of the strike price $K$ and the other parameters. Assume the gamblers who price options wish to maximize their expected return (not the log of their return) and for simplicity assume that the interest rate for cash is zero. (a) Model the variation in the log of the stock price by a random walk with discrete steps up or down by $log f$, with probabilities $p_{+}$ and $p_{-}$ respectively. [For thoroughness, if you want, show that $log f$ and $p_{+} 1/2 + epsilon$ are given in terms of the other parameters and $alpha = mu*mu*T/(N*sigma^2)$ by
    	epsilon = 0.5 / sqrt( 1.0 + 1.0/alpha )
    	log f = sigma^2/mu * alpha / (2.0 * epsilon) ]
    Work out the expected value of the European option by propagating backward from the leaves of the tree of possible prices. (b) For the American option, your message-passing algorithm should keep track, for every node in the tree, of 1) the payoff if the option is exercised at that time; 2) the expected payoff if the option is not exercised at that time, and at subsequent times the optimal decisions are made. 3) the value of the option (which is the max of those two quantities).
  2. St Petersburg Envelope paradox
    You are presented with two envelopes A and B, and told that one contains twice as much money as the other. You are given envelope A and allowed to see how much is in it, and offered the options of keeping envelope A or switching to B. What should you do?
    Amos Storkey's statement of the question explains why this is called a paradox:
    You are taking part in a game show. The host introduces you to two envelopes. He explains carefully that you will get to choose one of the envelopes, and keep the money that it contains. He makes sure you understand that each envelope contains a cheque for a different sum of money, and that in fact, one contains twice as much as the other. The only problem is that you don't know which is which.
    The host offers both envelopes to you, and you may choose which one you want. There is no way of knowing which has the larger sum in, and so you pick an envelope at random (equiprobably). The host asks you to open the envelope. Nervously you reveal the contents to contain a cheque for 40,000 pounds.
    The host then says you have a chance to change your mind. You may choose the other envelope if you would rather. You are an astute person, and so do a quick sum. There are two envelopes, and either could contain the larger amount. As you chose the envelope entirely at random, there is a probability of 0.5 that the larger check is the one you opened. Hence there is a probability 0.5 that the other is larger. Aha, you say. You need to calculate the expected gain due to swapping. Well the other envelope contains either 20,000 pounds or 80,000 pounds equiprobably. Hence the expected gain is 0.5x20000+0.5x80000-40000, ie the expected amount in the other envelope minus what you already have. The expected gain is therefore 10,000 pounds. So you swap.
    Does that seem reasonable? Well maybe it does. If so consider this. It doesn't matter what the money is, the outcome is the same if you follow the same line of reasoning. Suppose you opened the envelope and found N pounds in the envelope, then you would calculate your expected gain from swapping to be 0.5(N/2)+0.5(2N)-N = N/4, and as this is greater than zero, you would swap.
    But if it doesn't matter what N actually is, then you don't actually need to open the envelope at all. Whatever is in the envelope you would choose to swap. But if you don't open the envelope then it is no different from choosing the other envelope in the first place. Having swapped envelopes you can do the same calculation again and again, swapping envelopes back and forward ad-infinitum. And that is absurd.
    That is the paradox. A simple mathematical puzzle. The question is: What is wrong? Where does the fallacy lie, and what is the problem?
    Amos's solution. Another solution, Further discussion.
  3. Using information content to get insight into algorithm complexity.
    Consider an algorithm that sorts a bunch of objects by calling a subroutine that compares two of them for you, returning either "+" or "-". For simplicity, assume there are no ties.. How many times must an algorithm call the binary comparison operation to be able to guarantee that it will correctly sort the list?
    [Hint: Think of the initial state as being a permutation, which we are conveying to someone. How much information is there in a permutation?]
  4. Confidence-based assessment . Consider testing students using multiple-choice questions, or questions with yes/no answers. In a confidence-based assessment, a students rating of his confidence in an answer is taken into account in the marking scheme, so as to reward students for accurately acknowledging their uncertainty. In the LAPT schem, for example, you earn 1, 2, or 3 marks for a correct answer depending on whether you gave your confidence level as `1', `2', or `3'; incorrect answers receive 0, -2, and -6 marks respectively. Comment on this scheme. How should a student respond in terms of this `confidence' value, if he thinks there is a probability p that his answer is correct? How would you design a confidence-based assessment system?


(this question also belongs in the chapter about labelling dice with non-standard values, in the Data Compression / Probability area.)
  1. Is it possible to label the faces of two six-sided dice in such a way that the probability distribution of the sum is identical to the distribution of the sum that holds for two regular dice labelled 1..6? (Alan Beardon, 2005). Hint: this is one of the rare puzzles for which probability generating functions are useful. Hint2: are there any solutions apart from the trivial solutions in which for example one die is labelled 0..5 and the other 2..7? Hint 3: this problem is equally easy for the case of 4-sided dice.

Neural networks

  1. Another derivation of the Hopfield network
    Think of associative memory as an inference problem. Imagine that you know that a set of N memories {x} (with elements {+/-1} have given rise to `data' in the form of the Hebbian weight matrix W = sum x x^T. Imagine furthermore that a noisy version y of one of the memories x0 is provided. Your task now is to infer x0 given the available information, W, y, and N.
    Some approximations are required in order to get a simple answer out. Write down the posterior of x0. The contribution from y to the likelihood of x0 is a simple bias. The central quantity to think about is the other likelihood factor, P(W|x0). Approximate this quantity by assuming that, conditional on x0, W_ij is Gaussian-distributed with mean
    mu=x0_i x0_j
    and variance (N-1) (treating the contributions from the (N-1) other memories as independent). Approximate the likelihood factor P(W|x0) by
    prod_{ij} P(W_{ij}|x0).
    [An approximation since in fact the other patterns {x} introduce weak dependencies among the W_{ij}.]
    Now write down dynamics for each variable x^0_i that greedily maximize the posterior probability P(x0|W,y). What have you got?
    Thanks to Máté Lengyel for suggesting this exercise.

Lyapunov functions

  1. Once you have understood `Southeast', you may be able to tackle Conway's Soldiers. See the problem definition at mathworld; for a nice applet that can (with one click) embody Conway's soldiers go here.

    The exercise is this: find a Lyapunov function that can be used to prove that it's impossible to hop a solitaire peg a distance greater than 4 from a wall of pegs.

    Solution can be found at Plus, though their image explaining the Lyapunov function is off by one row.

Site last modified Sun Aug 31 18:51:08 BST 2014