Hacker News new | past | comments | ask | show | jobs | submit login

It's not random letters. It's random numbers encoded in base52. Because the number strings are encoded, they're probably not random at all.



You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them.


    hobbes@namagiri:~/scratch$ echo gYxseNjwafVPfgsoHnzLblmmAxZUiOnGcchqEAEwjyxwjUIfpXfJQcdLapTmFaqHGCFsdvpLarmPJLOZYMEILGNIPwNOgEazuBVJcyVjBRL|ent
    Entropy = 5.352821 bits per byte.
    
    Optimum compression would reduce the size
    of this 108 byte file by 33 percent.
    
    Chi square distribution for 108 samples is 650.52, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 92.6019 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.142915 (totally uncorrelated = 0.0).
    
...Which is meaningless without context. Let's see if we can create some context...

    hobbes@namagiri:~/scratch$ echo "You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them." |base64 |sed -e :a -e '$!N; s/\n//; ta'
    WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K
ok, that's base64. Let's turn that into base52...

    CL-USER> (to-base 52 (from-base 64 "WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K"))
    "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO"
Astute readers will note that this output contains numeric digits. There's more than one representation of "base-52". The one above goes from 0 through p. An alternative would go from A-z. But none of that matters to measure the entropy of base-52 encoded English text, ya dig?

    hobbes@namagiri:~/scratch$ echo "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO" | ent
    Entropy = 5.610518 bits per byte.
    
    Optimum compression would reduce the size
    of this 452 byte file by 29 percent.
    
    Chi square distribution for 452 samples is 2093.27, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 86.3496 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.018253 (totally uncorrelated = 0.0).
Conclusion: I don't think the original text was pseudorandom. If you can convince me that the original isn't "base-52 encoded" at all (and I'd be easily convinced), I'm willing to reevaluate that conclusion. I'm also interested to see if anyone sees a flaw in my process.


Well, we know that it was a test update that was released unintentionally. http://www.zdnet.com/article/microsoft-accidentally-issued-a...

So the question is why a test update would have "hidden" meaning underneath the random-looking strings.

As for the flaw in your methodology, Your ent command (not familiar with this, so just basing this off what I see) is assuming full use of the binary space (hence assuming 127.5 as mean of random data). No base52 data will use the full 256-value space, by definition. Base52 (azAZ) will actually have a mean of 93.5 for random data, extremely close to the measured mean. Serial correlation coefficient should also be higher for azAZ than 09azAP, because more of the alphabet is contiguous.

I did a little bit of analysis on the data as well to determine if the data was random or gibberish typed by a human on a keyboard, and found that most of the data lined up well for true (or pseudo) random. (Ugly code here: http://pastebin.com/9YN93xhi)

  Home row % expected: 34.6% 
  Home row % actual: 38.3%
  Expected upper: 50%
  Actual upper: 47.7%
  Expected sequential case match: 50%
  Actual sequential case match: 55.7%


Sure, I see the zdnet article. We could still play this game with any arbitrary string, though!

Regarding the assumption of full use of the binary space... May I phrase that differently? azAZ and 09azAP both leave big swaths of the space as always-zero or perhaps always-one. Another way to phrase it: you could write out an azAZ string using five-bit characters and have room to spare.

The swaths of emptiness in the range of possible values is why I took an English sample and did my best to encode it the same way as the original was, so that we'd have an apples-to-apples comparison. The English sample had higher entropy than the original--after "accounting for" encoding.

You are right, my method of "accounting for" encoding did nothing for the "serial correlation coefficient" metric. I didn't know what that was until you mentioned it, thanks.

Your actual/expected analysis is a good idea, though it took me a minute to understand what you were doing. I guess: "If these are truly random digits from azAZ, then half of them will be uppercase." Indeed. But... I just read an article[0] which convinced me that I have no idea how "random" works. (Specifically this bit: As an example, the probability of having exactly 100 cars is 0.056 — this perfectly balanced situation will only happen 1 in 18 times.

Cheers.

0: https://www.quantamagazine.org/20150925-solution-the-road-le...


Your understanding of base52 leaving large amounts of empty space is correct (you need 6 bits to fit it, though). That space throws off the "ent" measure. The 2.3 unused bits in every char dwarfs the entropy difference between random data and encoded text. This would probably be easier to see if you'd added in known pseudorandom strings for more context. (I see kaoD provided a Base64 example.)

Your understanding of "random" probably isn't that far off. After all, "the average number of cars on each road will be 100". When you look at random numbers they do the expected thing in aggregate. Individual samplings will vary, though. You could get all 200 cars on one road randomly. It's just exceedingly unlikely.

Cheers. :)


8 bits, 256 symbols.

7 bits, 128 symbols.

6 bits, -- oh. ... Time for coffee.


    $ dd if=/dev/urandom bs=512 count=1 2>/dev/null | base64 -w 0 | ent
    Entropy = 5.933850 bits per byte.
    
    Optimum compression would reduce the size
    of this 684 byte file by 25 percent.

    Chi square distribution for 684 samples is 2310.15, and randomly
    would exceed this value 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 85.0731 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.018269 (totally uncorrelated = 0.0).
I don't have the means to convert base64 to 52 but the result shouldn't be much different.


Oh, I didn't expect this. Thank you!


That's you should always test both what proves and what disproves your theory :)


I don't think anyone actually knows that. Which is exactly why it was inaccurate to describe it as "Base52", it made you think someone did know that.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: