Information and Communication

Get back to the course site.

General Procedure

In the second half of this course, you will pick an information-theoretic subject from the list below and study that. You are allowed to work alone or in groups of 2. Every group has to prepare and give a 45-minute talk followed by 15 minutes of questions about the chosen topic. Also, every group prepares a written report about the chosen topic (to be handed in by Sunday, 8 February 2015, 23:59).

Detailed Procedure

Presentation

A presentation should make your fellow students care about the subject and convey the main ideas of the topic/paper. Try to show how your topic connects to something we have learned in class. You should present the main result(s) with some precision, and it would be nice to give some key ingredients of proofs (where appropriate - some proofs are nasty and/or not interesting). Do not spend all your time putting up formula after formula: prepare a presentation you would like to listen to. It is important to stay on time, you leave a bad impression if your 45-minute talk lasts for 50 minutes or more. Here are a few general tips:

Report

The report consists of maximal 10 A4 pages giving an overview of the selected topic. A proper report contains at least the title of the chosen subject, an abstract, a general description of it, a high-level overview of the main ideas, the interesting low-level details, and references to scientific articles and other resources you have used (please insert hyperlinks for every reference). Make the report look more attractive by putting one or more pictures or figures. As a service, you get the opportunity to submit one draft version of your report at any time, and you will get some feedback from Chris within 2 working days (Monday to Friday) after submission. You have to hand in the final version of your report by Sunday, 8 February 2015, 23:59.

Grades

Your grade for the final presentation will be determined by the quality of the presentation, your ability to answer questions about the subject (we will use this list for the evaluation).

The final presentation counts 1/3 towards your final grade of the course, 1/3 will be determined by the report, and 1/3 will be determined by the average of the 3 homework exercises.

Suggested topics for final presentation:

Note that you are very welcome to suggest a topic yourself. In that case, send me (at least) one link to an academic paper about the subject. I will let you know quickly if that subject is appropriate. You do not have to stick to the indicated sources. To the contrary, you are encouraged to find other articles or sources that are relevant for the selected topic.

Some subjects are far more challenging than others. A good talk and report about a challenging subject is more impressive than a good talk/report about an easy subject, and the grade will reflect that.

Some topics are copied with kind permission from Mathias Madsen's course on information theory

Asymptotic equipartition property

In information theory, the analog of the law of large numbers is the asymptotic equipartition property (AEP). Is is a direct consequence of the (weak) law of large numbers. The AEP states that \(\frac{1}{n}\log \frac{1}{P_{X_1,\ldots,X_n}(X_1,\ldots,X_n)}\) is close to the entropy \(H(X_i)\) where \(X_1,\ldots,X_n\) are iid (independently and identically distributed) random variables. Using the AEP, you can easily proof Shannon's source-coding theorem both for lossy and lossless compression. An extension of the AEP is also the crucial ingredient in the proof of Shannon's noisy-channel coding theorem.

Study and present the material in Chapter 3 of [CT].

Data Compression in practice: Arithmetic Codes

We have seen that Huffman codes are optimal symbol codes for data compression. However, they also have disadvantages, for instance when the source changes. Arithmetic codes are more suited to this case. Arithmetic codes are described in Section 5.5 of [CF] and Section 6.2 of [MacKay]. An interesting application is the Dasher tool by MacKay's group. Describe in detail what the disadvantages of Huffman codes are and how other codes (such as arithmetic codes) get around these problems.

Data Compression in practice: Lempel-Ziv coding

Section 6.4 of [MacKay] describes the Lempel-Ziv algorithms which are widely used for data compression (e.g. in the gzip command). They follow another philosophy than the codes we have seen. Understand how they work and try to analyze them in detail. You can also try to get your hands on the source of other modern-day compression algorithms and describe what kind of techniques they are using and how well they work.

Noisy-channel coding: Zero-error case

In the problem of noisy-channel coding, we study the question of how much information can be transmitted over a noisy channel. A channel can be described by a conditional probability distribution \(P_{Y|X}\) which for every possible channel input \(x\) describes a probability distribution \(P_{Y|X=x}\) of channel outputs \(Y\). We are interested in determining the maximum number of distinguishable signals for \(n\) uses of a communication channel.

In the zero-error case, the receiver has to be able to identify the message \(x\) with certainty from his output \(y\). In this setting, we are only interested if two input symbols \(x \neq x'\) can possibly be confused by the receiver or not. Hence, (the zero-error capacity of) a channel is a property of the confusability graph which corresponds to the channel. Studying this problem has very interesting connections with graph theory.

Sources you can study for this topic (besides the wikipedia pages above):

Noisy-channel coding: Asymptotic case

In the (more common, and more realistic) asymptotic setting of noisy-channel coding, the capacity of a channel described by \(P_{Y|X}\) is defined to be \(C := \max_{P_X} I(X;Y)\). Shannon's noisy-channel coding theorem states that the capacity is an exact characterization of how much information can be transmitted of the channel in the asymptotic limit of many independent uses of the channel. Any code whose rate is above capacity is bound to make a large error, and there exist codes with rates below capacity that make arbitrarily small errors.

It is quite challenging to prove this theorem and its converse, but it is so fundamentally important that you want to have done it. It is very well described in Chapter 7 of [CT] and Chapter 10 of [MacKay].

Hamming Code, an example of Error Correcting Codes

If you send messages through a binary channel with a fixed and symmetric probability of bit flips (i.e., 0 being received as 1 or vice versa), then your task as an encoder is essentially to choose codewords that are as far apart as possible in terms of bit flip distance. A method for creating such codebooks was invented by Hamming in the 1940s. It turns out that his encoding method can be expressed as a certain kind of matrix multiplication, and the choice of encoding scheme for the binary symmetric channel thus corresponds to a choice of matrix. The reverse inference problem then similarly becomes a linear-algebra problem. Check Chapter 1 of [MacKay]. Things you can do under this topic:

Hash Codes: Codes for Efficient Information Retrieval

While Shannon's two main theorems are about data compression and noisy-channel coding, a practically very relevan task is information retrieval. A simple example of an information-retrieval problem is the task of implementing a phone directory service, which, in response to a person's name, returns (1) a confirmation that that person is listed in the directory; and (2) the person's phone number and other details. This is a classical computer-science problem called the dictionary problem: the task of designing a data structure that maintains a set of data during 'search', 'delete' and 'insert' operations. A standard solution to the dictionary problem is a hash table. Learn all about them and their relation to information theory in Chapter 12 of [MacKay].

Kolmogorov Komplexity

The great mathematician Kolmogorov culminated a lifetime of research in mathematics, complexity, and information theory with his definition in 1965 of the intrinsic descriptive complexity of an object. In the treatment in the course, the object \(X\) has been a random variable drawn according to a probability mass function \(P_X(x)\). If \(X\) is random, there is a sense in which the descriptive complexity of the event \(X = x\) is \(\log \frac{1}{P_X(x)}\) , because \(\lceil{\log \frac{1}{P_X(x)}}\rceil\) is the number of bits required to describe \(x\) by an optimal code. One notes immediately that the descriptive complexity of such an object depends on the probability distribution.

Kolmogorov went further. He defined the algorithmic (descriptive) complexity of an object to be the length of the shortest binary computer program that describes the object. (Apparently, a computer, the most general form of data decompressor, will after a finite amount of computation, use this description to exhibit the object described.) Thus, the Kolmogorov complexity of an object dispenses with the probability distribution. Kolmogorov made the crucial observation that the definition of complexity is essentially computer independent. It is an amazing fact that the expected length of the shortest binary computer description of a random variable is approximately equal to its entropy. Thus, the shortest computer description acts as a universal code which is uniformly good for all probability distributions. In this sense, algorithmic complexity is a conceptual precursor to entropy.

Learn more about the topic in Chapter 14 of [CT].

Stopping Rules and the Secretary's problem

A stopping rule is a rule for deciding when to stop doing something, based on a finite amount of past data - for instance, when to leave the casino based on your past gambling record, or when to stop a series of clinical trials based on the results so far. In the secretary problem, you are confronted with a finite stopping problem.

Some things you might want to do with this topic are:

Markov Processes: Entropy Rates of a Stochastic Process

In case of iid random variables, the asymptotic equipartition property (AEP) establishes that \(n H(X)\) bits suffice on the average to describe \(n\) iid random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? In Chapter 4 of [CT], such stochastic processes are defined and studied. It is shown, just as in the iid case, that the entropy \(H(X_1,X_2, \ldots, X_n)\) grows (asymptotically) linearly in \(n\) at rate \(H(\mathcal{X})\), the entropy rate of the process.

Gambling and Data Compression

At first sight, information theory and gambling seem to be unrelated. But as you can read in Chapter 6 of [CT], there is a strong duality between the growth rate of investment in a horse race and the entropy rate of the horse race. Shannon himself exploited his knowledge to actually win money by gambling. Indeed, on can show that the sum of the growth rate and the entropy rate is a constant. In the process of proving this, it is argued that the financial value of side information is equal to the mutual information between the horse race and the side information. In order to work on this topic, you will have to understand parts of the previous topic (Chapter 4 of [CT]) as well.

Codebreaking for Traditional Cipher Systems

Information theory provides general insights about the security of various encryption schemes. However, cracking an "insecure" cipher often requires a lot of clever probabilistic inference, for instance in the form of sampling schemes like Gibbs sampling of the posterior distribution over encryption keys.

Historically, cracking the German Enigma encryption machine was such a task, you can read about it in Section 18.4 of [MacKay]. More recently, Kevin Knight and colleagues has also written a number of papers on codebreaking from an explicitly Bayesian viewpoint (e.g., this one).

If you like a programming challenge, you can also try to decipher the encrypted texts listed under this topic on Mathias' page and report on your findings.

Randomness Extraction and Privacy Amplifications

In cryptography, we want to limit the amount of information an eavesdropper learns. One way to make sure that a certain n-bit string \(X\) is close to unknown to an eavesdropper Eve who holds side information \(Y\) about \(X\) is to extract the randomness out of \(X\) by picking a random function mapping \(n\) bits to less bits, thereby making it more difficult for Eve to learn \(f(X)\). Study the details of this privacy-amplification procedure in Sections 8 and 9 of the [CF] script. This is an essential building block for essentially all information-theoretically secure protocols. It often also works agains quantum adversaries.

Uniqueness of the Logarithmic Uncertainty Measure

In his 1948 paper, Shannon gave a small set of axioms that were sufficient to uniquely select entropy as a measure of uncertainty (up to the choice of logarithmic base). Other authors since him have used slightly different but equivalent sets of axioms, and many textbooks in information theory contain a version of this theorem. Some things you can do under this topic are:

Entropy rates in language

This paper investigates the entropy rate of English text. It was one of the first in a now blooming paradigm that tries to capture linguistic behaviour as communication in an information-theoretic sense. The authors show that while the average entropy of each sentence in a paragraph rises if the sentence is taken out of context, the entropy given all previous sentences stays more or less constant. It is hypothesized that linguistic communication is organized so as communicate at a constant entropy rate. Later research has backed up this hypothesis for spoken language. See also this link.