Here is an excellent book on this subject "From Classical to Quantum Shannon Theory" https://arxiv.org/abs/1106.1445
Each little ball is from a particular signal as received with its errors in transmission, and the big ball is from the power of the signal.
So, the number of little balls is how many different signals can send, with the assumed added noise level, and still have the signals all distinct at the receiver. So, sure, if send with more power have a larger big ball and more signals. Also if have lower noise, then signals are corrupted less, get smaller little balls, and can send more signals.
Sorry, that's my best summary from the time I did look at information theory some years ago!!!
Mathematics has a peculiar habit (to those outside the domain) to call things like this "introductions" - it's technically true, but unhelpful to those outside the target audience. Unless you dilute the topic to the point of unusability, this is basically what an introduction to the subject looks like.
Still, it's a little hard for me to figure out what the target audience for this is. Grad students would probably want a focused textbook, so I'm guessing this is for undergrads who are only partially through the full prereqs, but who have a lot of interest in it.
However it made me hunt around for more appropriate sources and I found a few, so that at least is good.
They should first define what they mean by probability. For example, what am I supposed to fill in for p when b always occurs after exactly 3 times a?
(p is independent)
If for nothing else but space conservation, even relatively accessible papers and books need to constrain how fully self-contained they are. Most advanced topics have prerequisites - in this case, you should really know probability (and therefore calculus/analysis) well before tackling information theory. Linear algebra will also be required if you want to do anything practical with information theory, like error correcting codes.
Striking balance between full exposition and reasonable length can be tricky. You could make this paper fully accessible, but it would be significantly longer for questionable benefit. Most authors are not, in fact, good writers of pedagogical material. This is especially the case if they write beyond their specialization. It turns out it's generally better to go through several more compact and focused resources than one monolithic one.
The string is sequence of independent and identically distributed random variables (Bernoulli process). a and b are mutually exclusive events, not independent.
I also provided the answer.
But this does not hold for most real-life data streams.
Learning advice: You shouldn't read line by line stopping to think after every sentence. Skim each chapter first. What is the main point? You may also want look at the next chapters to see where they go before finishing the chapter.
It is actually a very reasonable assumption for almost all kinds of data -- given that suitable compression is applied. Data, which is well-compressed, is essentially uniformly random.
When p('a') = p and p('b') = 1 - p, it means that the probability of 'a' and 'b' do not depend on anything. The probabilities must sum to 1, since it is the certain event, so p('a') + p('b') = p('a' + 'b') = p + 1 - p = 1. That is when we assume that only the symbols 'a' and 'b' are possible.
If there was a relationship between 'a' and 'b', say that 'b' always occurs after every 'aaa', then the when we receive 'aaa', we know that the next symbol is 'b' -- always.
So in this case the probability of 'b' has a relationship, a condition on the history, which would be written as p('b' | hist = 'aaa') = 1. A much more useful framework for this is a Markov process the a history/memory of 3. A graph for such a process can be seen here: https://mermaidjs.github.io/mermaid-live-editor/#/view/eyJjb...
Each node is a state and the edges represent the possible outputs. The rul for such a graph is that the probabilities of the "leaving" edges must sum to 1 -- we must of course always leave the current state we are in. Notice that it takes a sequence of 'aaa' to enter node "D", afterwhich we _must_ output a 'b'. Using some matrix formulations it is possible to calculate the probabilities of 'a' and 'b' (the stationary distribution I think it is called).
And to return to the first point, why is independence a reasonable assumption. In the Markov process, in node D, we known that we must always output a 'b'. In terms of information theory, if we receive 'aaa' then the 'b' is given and provides no new information. There we can perfectly predict it and we could also remove it (compress the data), without _losing_ information.
contains the same information as
since we _know_ that there must be a 'b'.
I hope that explains why independence is reasonable.
> It is actually a very reasonable assumption for almost all kinds of data -- given that suitable compression is applied. Data, which is well-compressed, is essentially uniformly random.
What kinds of data are an exception? Your explanation seems to cover everything
Second, you might (will) not be able to completely compress the data. A picture might be worth a thousand words, but they still take out a megabyte or so on disk. That makes for about 1000 bytes per word ;) So the entropy/information of a picture might be very small ("A dog jumping into water"), but we have no chance of truly understanding a general source (reality) and expressing its full machinery.