If you're feeling more adventurous, Cover and Thomas's textbook is the information theory bible. It's very dense but absolutely packed with insight.
It starts by talking about African drum speech, and how it is a very simplified version of speech, and how it deals with error correction over great distances. From there it goes into a lot of great historical explanations, including Shannon.
Ha! noticed what you did there. Was that intentional ?
I think grad students might be able to relate to the following: If one's masters thesis is as world changing as it was, how does one even muster up the courage to join a PhD program after that. What could one possibly do that could have a stab at even a token respectability compared to foundation of digital circuitry. Foundation of something that has really changed the world. For grad students who are mere mortals, that would be terrifying.
If you want a little bit of both, I highly recommend Seife's Decoding the Universe: How the New Science of Information Is Explaining Everything in the Cosmos, from Our Brains to Black Holes or Gleick's The Information: A History, A Theory, A Flood.
> But, is that the best we can do? Shannon’s Theorem 3 does not address that question, since it only suggests a suboptimal code. (The optimal code of rate R
simply disregards all but the 2^(NR) most probable sequences of length N.) Shannon finds the answer in Theorem 4: as long as we require probability of error strictly less than 1, asymptotically, we cannot encode at rates below the entropy. This statement is commonly known as the strong converse source coding theorem. The converse (or weak converse) source coding theorem asserts that error probability cannot vanish if the compression rate is below the entropy.
Later on the same page:
> The variable-length source code that minimizes average length was obtained by D. Huffman
So what you want is to learn about how Huffman coding works. There are many such examples - it's common in data structures courses and text books. One example is in the classic "Structure and Interpretation of Computer Programs" at https://mitpress.mit.edu/sicp/full-text/sicp/book/node41.htm... . Look around and you'll find plenty more.
There are also many code examples at at https://rosettacode.org/wiki/Huffman_coding , and worked-out examples at StackOverflow, like https://stackoverflow.com/questions/11587044/how-can-i-creat... .
The Wikipedia page at https://en.wikipedia.org/wiki/Huffman_coding seems unhelpful for a novice.
The patents have now expired, which is why the technology starts to appear in open source and royalty free implementations like Zstd.
> The optimality of the Huffman code must have seemed at the time to leave little room for further work in fixed-to- variable source coding. That, however, proved not to be the case, because of two major difficulties: 1) the distribution of the source may not be known  when the code is designed (Section II-E), and 2) although the Huffman algorithm need
not operate symbol by symbol, its complexity grows very
rapidly with the length of the source block. ... The second shortcoming is circumvented by the arithmetic coding method of J. Rissanen 
(generalized in  and  and popularized in ), whose
philosophy is related to that of the Shannon–Fano code.
I made two assumption in my comment. 1) s-c-h knows nothing about compression, so would like to start with the easiest code developed along information-theoretical lines, and 2) the author of the IEEE paper knows enough about the topic that I could quote the text directly.
To clarify, the author is https://en.wikipedia.org/wiki/Sergio_Verd%C3%BA and is "the Eugene Higgins Professor of Electrical Engineering at Princeton University, where he teaches and conducts research on Information Theory in the Information Sciences and Systems Group."
I do not have enough knowledge of the field to tell if he included the correct qualifiers. The Huffman code is a "variable-length source code that minimizes average length" while arithmetic codes appear after mentioning the goal of "minimum average length in terms of the distribution of the source."
That appears correct to me.
It is also used for deep space communications, where squeezing every bit is essential.
The wikipedia article is quite well crafted.
For a sample CPP implementation can be found here.
Khan Academy has done a information theory
I do remember the algorithms (Huffman coding), but the high level concepts are harder to remember - information being directly correlated to randomness (entropy). Information is the unknown (random?) message, that sounds more like noise. Channel capacity makes sense: based on noise there's an upper limit on how fast you can communicate bits and groups of bits.
You read the newspaper each day. Every day it's got the same news: "Today the sun shines." This is no useful information (no entropy) and you can predict it very well.
Second scenario. You read the newspaper, but each day it states that either "Today the sun shines" or "Today is raining" with 50/50 probability. That's maximum entropy. That's useful information, which you can't reliably predict.