
Shortest bit sequence that's never been sent over the Internet (2017) - wglb
https://www.seancassidy.me/whats-the-length-of-shortest-bit-sequence-thats-never-been-sent-over-the-internet.html
======
panic
_> We're looking for a specific sequence, though, not a specific number of
heads in a row. We don't even know what the sequence is since it hasn't been
sent yet. Is that a problem? Not at all! We're looking for some sequence of
length n, and given that both 0 and 1 are equally likely, the sequence 00110
is equally likely as 11111._

Interestingly enough, this isn't true!

First, let's test on a small example: how likely are the substrings "11" and
"10" to appear in binary strings of length 3? Here's a table with the matches
marked.

    
    
            "11" "10"
        000
        001
        010        *
        011   *
        100        *
        101        *
        110   *    *
        111   *
    

"10" can appear in four ways, but "11" can only appear in three. Why is this?

Say you're scanning through a bit string, looking for "11111". You've seen
"111" so far -- that's three matches. Now you encounter a "0". Your counter
resets to zero matches until you see a "1" again.

Now say you're looking for "00110". You've seen "001" so far. Just like
before, you encounter a "0". You still need to reset your counter, but this
time the "0" may actually be the start of a _new_ "00110". So your counter
resets to one, not zero. This means "00110" matches are easier to find, and
happen more frequently!

~~~
bhouston
> This means "00110" matches are easier to find, and happen more frequently!

They do not happen more frequently in a random bit stream. You are absolutely
and completely wrong about this.

They only happen more frequently if you switch the problem from counting
occurances to counting bitstreams that contain these occurrences.

The reason this is is because the sequence 111 contains two substrings of 11.
Thus if this happens in a bitstream and you are counting bitstreams you only
get a count of 1. Where as with the counting frequency you would still get
two.

This will occur any time a sequence can be overlapped with itself.

~~~
panic
You're right, I didn't word that very clearly. I meant "more strings contain
this substring", not "this substring occurs more times total" (which, as you
say, would be the usual meaning of "matches happen more frequently").

Why care about strings containing a particular substring versus total number
of occurrences? The article referenced this PDF:
[https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf](https://www.cs.cornell.edu/~ginsparg/physics/INFO295/mh.pdf),
which is about how many coin flips it takes to see a run of N heads. That's
really what the argument in the second half of my post is about, but it
applies to the "how many strings contain this substring" problem too, and that
one seemed simpler to draw a table for.

------
schoen
As the article notes, encryption does make a huge difference here. For global
web use we've reached something like 70% of sessions

[https://letsencrypt.org/stats/#percent-
pageloads](https://letsencrypt.org/stats/#percent-pageloads)

However, that doesn't cover 70% of bytes. For example, software updates are
often downloaded over HTTP (hopefully with digital signatures!). Debian and
Debian-derived distributions distribute most of their package updates over
HTTP, authenticated with PGP signatures.

Most of those packages are nonetheless _compressed_ , which increases the
variety in bit sequences, but then most of the downloads are of _identical
compressed files_ , which decreases the variety.

On the other hand, video streaming is often encrypted now, but still sometimes
not encrypted. But even when it's not encrypted for confidentiality or
authenticity, it's often encrypted in order to impose DRM restrictions. In any
case, it's usually also heavily compressed, which again increases the variety
of bit sequences transmitted even for unencrypted video.

To give a rough guess, I think the _combination_ of encryption and compression
means that the author is roughly correct with regard to information being
transmitted today. Even when different people watch the same video on YouTube,
YouTube is separately encrypting each stream, and nonrandom packet headers
outside of the TLS sessions (and other mostly nonrandom associated traffic
like DNS queries and replies) represent a very small fraction of this traffic.

It might be interesting to try to repeat the calculation if we made different
assumptions so that compression was in wide use (hence the underlying
_messages_ are nearly random) but encryption wasn't (hence very large numbers
of messages are byte-for-byte identical to each other). Can anybody give a
rough approach to estimating the impact of that assumption?

~~~
cortesoft
Even if every single bit of traffic was encrypted, a huge portion of that
would still be identical - TCP headers, for one thing, would share a lot of
common bits for each packet. Although bandwidth reporting often ignores those,
so I am not sure about the data source for world bandwidth.

~~~
wskinner
Wouldn’t some combination of unique session keys, PFS algorithms, or block
encryption modes make this false? When would the headers encrypt to the same
ciphertext unless you were using ECB mode with the same key?

~~~
schoen
I think you misinterpreted cortesoft's point, which I could try to make
clearer:

"Even if every single bit of _payload_ traffic was encrypted, a huge portion
of _the traffic actually sent over the wire_ would still be identical - TCP
headers, for one thing, would share a lot of common bits for each packet."

So I think you're in agreement here.

~~~
votepaunchy
The original question was “sent over the Internet” not “sent over the wire”.

~~~
dooglius
Anything in the IP header (perhaps excluding the TTL and checksum) and below
is sent over the Internet.

------
x1798DE
I like this general approach, but presumably you don't want the point where
50% of the sequences have never been transmitted (on average, assuming
randomly distributed data), you want the first point where at least 1 hasn't
been transmitted, so you should be solving for the point where the probability
of a sequence of that length not appearing times the number of sequences of
that length is equal to 1.

That said it's a logarithmic scale so the length is probably not that much
shorter than the one he came up with.

~~~
mirimir
How can you ever do better than a probabilistic estimate for the shortest
sequence length where one has never been transmitted?

------
tzs
Note: I'm going to use bit strings rather than bit sequences.

Given a total of B = 3.4067 x 10^22 bits sent over the internet, I don't think
there is any reasonable way to say for sure what the length of the smallest
string that has never been sent is, but we can say that there is definitely a
75 bit string that has never been sent.

Take all of the internet transmissions, and concatenate them, giving a string
S of B bits.

Every string that has been transmitted is a substring of S.

There are at most B substrings of length n in S, and hence at most B distinct
strings of length n that have been transmitted.

If we pick n such that 2^n > B, there must be at least one string of length n
that has not been transmitted. 2^n > B whenever n >= 75.

Hence, if the value of B is correct, then there must be a 75 bit string that
has never been sent.

~~~
evanb
If you concatenate to get S you may introduce substrings that have never been
themselves transmitted, though of course I agree that every sequence that has
been transmitted is a substring of S.

~~~
cup-of-tea
Yeah which means 74 is an upper bound on the length of bit sequence which
might have been transmitted already.

------
sgentle
Relatedly, there must be a (much longer) shortest sequence of bits that it is
_impossible_ to send over the internet, because beyond a certain length the
payload must be split over multiple packets, requiring additional header bits
in between.

In theory, the largest IPv4 packet that can be sent is 65,535 bytes. IPv6 is
the same, but also allows for "jumbograms" that can be up to 4GB long, yeesh.
However, the practical limit is the maximum frame size of the underlying link.
(Standard) Ethernet and PPP both top out at 1500, ATM can handle 4456. Non-
standard Ethernet jumbo frames can go up as high as 9216 bytes.

But assuming you consider "send over the internet" to imply that most
destinations on the open internet would be able to receive it, that means a
frame size of 1500 bytes, giving you 12000 bits to play with. This leads to
the curious result that it's impossible to send a sequence of more than 11870
"1"s.

The shortest header you could have is a 160-bit IPv4 header, of which the last
32 bits (the destination address) can't all be 1 because 255.255.255.255 isn't
routable, and actually 127.255.255.255 isn't either, so the earliest place you
can start is bit 130. After 11870 "1"s, you reach the 4-bit version field of
the next packet, which can't start with a 1 because it would indicate IPv8 or
higher, and I don't think that exists... or at least it hasn't seen widespread
adoption.

------
orangea
The question that this person posed is

> At what value for X is there a 50% chance there [exists] a sequence of
> X-bits in length that hasn't been transmitted yet?

But if I understand correctly, the question that they answered is "At what
value for X is there a 50% chance that, given some specific sequence of X
bits, that sequence hasn't been transmitted yet?" Wouldn't these two questions
have different answers?

Edit (now that I've thought about it more): On top of that, the expected value
of a random variable isn't necessarily a value that it has a 50% chance to be.
So even my interpretation of this result is wrong.

~~~
joshuamorton
>Wouldn't these two questions have different answers?

Yes, though in practice they're close.

To answer the actual question asked. We can say that there is a length n, such
that all sequences S, of length n, cannot be contained in our total bits
transferred, B. This is a consequence of the pigeonhole principle.

Naively, this would be B = n * S, and S = 2^n. So B = n * 2^n. Note that we
can actually compress things more though, if we assume that in the worst case
every window in the internet bit-corpus is unique.

That this is possible isn't immediately obvious, but consider

01, 00110, 0001011100, 0000100110101111000,
000001000110010100111010110111110000 (which, interestingly, isn't in OEIS, but
related sequences: [https://oeis.org/A052944](https://oeis.org/A052944) and
[https://oeis.org/A242323](https://oeis.org/A242323) are).

These bitstrings contain, perfectly overlapping, every bitstring of length n,
for n from 1-5. In general, it takes 2^n + n - 1 bits to convey every possible
bitstring, if you manage to overlap them perfectly (if someone can prove this,
please do. I thought it was grey code/hamiltonian path over hypercube related,
but I unconvinced myself of that).

EDIT: Someone else mentioned De Brujin sequences, which these are. And they
are based on hamiltonian paths, although not over hypercubes :(. And my
sequence is on OEIS, as [https://oeis.org/A166315](https://oeis.org/A166315),
just converted to decimal. /EDIT

So the real answer is just B = 2^n + n - 1, but for n > 15 or so, n - 1 is so
small we can ignore it. In other words, our length n is just log(B). Given the
assumption in the article of 3.4067e22 bits, the base 2 log of that
is...74.85. This is exactly 1 more than what the article says is the point
where you'll have seen half the messages. This isn't a coincidence.

Which raises an interesting question that I leave as an exercise to the
reader:

Why is the length for which you _cannot_ have conveyed all bit strings
_exactly_ 1 bit more than the point where you've conveyed about half of them?

~~~
freyir
You’re giving the range for a 50% to 100% chance of non-transmission.

But in the spirit of the original question, I’d suppose we want to find _n_
such that the probability of non-transmission is 2^- _n_ , and the expected
number of such transmissions is 1? Is that number similarly close?

------
larl
Back of the envelope to double check:

Some estimates put the internet traffic at 2^70 bits per year and growing. For
any short sequence of N bits, there are approximately that 2^70 such sequences
transmitted per year. Let's assume that, since they are encrypted, they are
uniformly distributed. So what is the sequence length for which our p(seen
all) = 50%? Seems like 71-bits is about right.

There traffic growth curve is probably ~2x per year, so each year we add 1
more bit each year. The sum of all internet traffic for all time, at present,
is therefore about 1 more bit.

So ~73 bits.

Math checks out.

------
dosycorp
The way I think of this is:

\- nearly everything is compressed

\- many things are encrypted

Therefore,

For any given "random" ( "high entropy" ) string of length X, there's some
non-negligible chance it's already been sent.

But it's far less likely that a ( partially degraded ) non random string is
sent. Why ?

Consider this:

"the cat sat on the hat" ( probably sent )

"the cut sat on the hat" ( still probably sent )

"thx cut set19n the mkt" ( waaaay less likely to be sent )

"thKxc8t suts n x4e m-t" ( probably never sent ... until now :) )

My reasoning is like, all random strings are ( happy / random ) in the same
way. They all look alike. High entropy, but low organization / structure.
Because of compression and encryption, any random string probably has as good
a chance to be sent as any other, so looking at random strings doesn't really
get us anywhere ( but I do make a very very rough calculation at the end that
says probably all 7 byte strings have been sent ).

It's going to be far easier in my opinion to find an "almost-language" string
( partially degraded, like the above examples ) that's never been sent.

Remember, Google whacks? 1 search result. One tactic was putting together
uncommon words. Another was misspellings.

Basically the intuition / intuitive idea I'm trying to convey is : pick any
random high entropy string of given short length, and pick any language string
of given short length, and they are both, in my opinion, more likely to have
been sent than an "almost language" string of same length. The more degraded
you make it ( up to a point, heh ) the less likely it was ever sent.

 _Very rough calculation about random strings_

So, assuming the question is for what X is p > 0.5, and assume that 1
zettabyte has been sent through the net through it's entire history, so
10^18*8 bits, or roughly 2^(63.8), so roughly every 58 bit string has been
sent.

So roughly every 7 byte string ever possible has been sent on the internet.
Probably.

------
gtrubetskoy
It'd be interesting to know the length of a sequence that will _never_ be sent
over the Internet. In B. Schneier's "Applied Cryptography" there is a
discussion of 256-bit numbers, the minimal amount of energy and mass required
to guess them and conclusion that "brute-force attacks against 256-bit keys
will be infeasible until computers are built from something other than matter
and occupy something other than space."

The length of a never-transmitted-on-the-internet sequence is probably much
shorter than 256 bits, even 128 bits (as the article mentions).

More details here:

[https://security.stackexchange.com/a/25392](https://security.stackexchange.com/a/25392)

------
candiodari
>>> len(''.join(map(lambda x: bin(ord(x))[2:], "I'm wrong")))

61

~~~
schoen
Hee hee!

But your joke is probably more precise if you write it as

8 * len("I'm wrong")

since the leading 0 bits still get sent on printable ASCII characters.

~~~
james_a_craig
Come on, you need to encode it if you're counting bytes not characters, even
if that's equivalent for this particular string. :)

8*len("I'm wrong".encode('utf-8'))

------
FRex
It's a bit ironic in a special paradoxical math like way (like [0]) that if
someone discovers it and posts it online it'll instantly (ca. HTTPS vs. HTTP,
other encryption, whether or not someone sees the page, etc.) be wrong by
definition. It's like figuring what do you call an UFO that has been
identified.

[0] -
[https://en.wikipedia.org/wiki/Russell%27s_paradox](https://en.wikipedia.org/wiki/Russell%27s_paradox)

~~~
coroxout
I am suddenly nostalgic for the Googlewhack fad:
[https://en.wikipedia.org/wiki/Googlewhack](https://en.wikipedia.org/wiki/Googlewhack)

"A Googlewhack is a contest for finding a Google search query consisting of
exactly two words without quotation marks that returns exactly one hit. A
Googlewhack must consist of two actual words found in a dictionary.

Published googlewhacks are short-lived, since when published to a web site,
the new number of hits will become at least two, one to the original hit
found, and one to the publishing site."

~~~
mywittyname
Does anyone remember a comedian that did a comedy special many, many years ago
centered around Googlewhacks? I remember he got a tattoo of a Texas ID on his
arm.

------
newman8r
Interesting to think about. Would probably make a good interview question just
to see how people try to solve it.

------
PaulRobinson
If the method stands up and considered fine, that means the shortest string is
increasing at a rate of ~1 bit every 3 years.

That implies that if the Internet does not grow any more, we should be
confident of a > 50% chance of a UUID not being universally unique in about
150 years.

Is this the next Y2K/2038 problem? :)

~~~
C4stor
If the traffic doesn't go up significanly, the shortest string increase will
drop exponentially. Even with only the last 5 years, you can observe it's
slowing down, and that's with an hefty amount of traffic increase to back it
up. 128 bits is super safe for a long long time !

------
replicatfied

      This is how my intuition went: it's probably less than 128 
      bits because UUIDs are 128 bits, and they're universally 
      unique.
    

But what's in a name? There's no natural law constraining the UUID standard,
such that they must be actually universally unique. And 128 bits isn't such an
incredible bit space.

MD5 hashes are 128 bits, and prone to manipulating in favor of collisions.

Don't get me wrong, 3.402823669209385*10^38 is a huge number, and we haven't
used enough passwords to occupy every value in that key space, but I still
don't imagine 128 bits provides truly universally unique coverage, but really
just pretty okay uniqueness coverage.

~~~
X-Istence
I think you should take a gander at this:
[http://nowiknow.com/shuffled/](http://nowiknow.com/shuffled/)

~~~
sjcsjc
Or maybe this
[https://czep.net/weblog/52cards.html](https://czep.net/weblog/52cards.html)

------
bmm6o
"This matched my intuition nicely!"

His intuition was that it would be between 48 and 128, and he's patting
himself on the back that his calculation resulted in a number between those?
Those goalposts are super far apart!

~~~
mywittyname
That's pretty close for what was basically an off-the-cuff estimation. He
chose those two numbers because their use in the real world gives him a
reasonable point of reference.

The calculated value of 74 bits is pretty close to the mid-point of his
estimation.

------
croddin
How about this question: If all matter in the universe was dedicated to
producing new bit sequences until the heat death of the universe, what is the
length of the shortest sequence that would never be generated?

------
threepipeproblm
Reminds me of the
[https://en.wikipedia.org/wiki/Berry_paradox](https://en.wikipedia.org/wiki/Berry_paradox)

~~~
grkvlt
Yeah, I sort of assumed that the Berry paradox was going to be the point too -
Anyone who found the shortest bit sequence could never display it on a web
page since that would cause its transmission over the Internet, making it
ineligible...

~~~
Dylan16807
Displaying a series of bits on a web page doesn't cause that series of bits to
be transmitted.

~~~
grkvlt
Yeah, I realise that, but what if there was a link to a binary with those
bits, etc.

------
dzdt
I'm a little surprised it is as high as it is. I am used to thinking "64 bits
is enough for anyone."

~~~
samfriedman
Enough for anyone, but not necessarily enough for everyone.

------
shujito
hmm... I was expecting the answer to be zero.

