Hacker News new | past | comments | ask | show | jobs | submit login

Well, we know that it was a test update that was released unintentionally. http://www.zdnet.com/article/microsoft-accidentally-issued-a...

So the question is why a test update would have "hidden" meaning underneath the random-looking strings.

As for the flaw in your methodology, Your ent command (not familiar with this, so just basing this off what I see) is assuming full use of the binary space (hence assuming 127.5 as mean of random data). No base52 data will use the full 256-value space, by definition. Base52 (azAZ) will actually have a mean of 93.5 for random data, extremely close to the measured mean. Serial correlation coefficient should also be higher for azAZ than 09azAP, because more of the alphabet is contiguous.

I did a little bit of analysis on the data as well to determine if the data was random or gibberish typed by a human on a keyboard, and found that most of the data lined up well for true (or pseudo) random. (Ugly code here: http://pastebin.com/9YN93xhi)

  Home row % expected: 34.6% 
  Home row % actual: 38.3%
  Expected upper: 50%
  Actual upper: 47.7%
  Expected sequential case match: 50%
  Actual sequential case match: 55.7%

Sure, I see the zdnet article. We could still play this game with any arbitrary string, though!

Regarding the assumption of full use of the binary space... May I phrase that differently? azAZ and 09azAP both leave big swaths of the space as always-zero or perhaps always-one. Another way to phrase it: you could write out an azAZ string using five-bit characters and have room to spare.

The swaths of emptiness in the range of possible values is why I took an English sample and did my best to encode it the same way as the original was, so that we'd have an apples-to-apples comparison. The English sample had higher entropy than the original--after "accounting for" encoding.

You are right, my method of "accounting for" encoding did nothing for the "serial correlation coefficient" metric. I didn't know what that was until you mentioned it, thanks.

Your actual/expected analysis is a good idea, though it took me a minute to understand what you were doing. I guess: "If these are truly random digits from azAZ, then half of them will be uppercase." Indeed. But... I just read an article[0] which convinced me that I have no idea how "random" works. (Specifically this bit: As an example, the probability of having exactly 100 cars is 0.056 — this perfectly balanced situation will only happen 1 in 18 times.


0: https://www.quantamagazine.org/20150925-solution-the-road-le...

Your understanding of base52 leaving large amounts of empty space is correct (you need 6 bits to fit it, though). That space throws off the "ent" measure. The 2.3 unused bits in every char dwarfs the entropy difference between random data and encoded text. This would probably be easier to see if you'd added in known pseudorandom strings for more context. (I see kaoD provided a Base64 example.)

Your understanding of "random" probably isn't that far off. After all, "the average number of cars on each road will be 100". When you look at random numbers they do the expected thing in aggregate. Individual samplings will vary, though. You could get all 200 cars on one road randomly. It's just exceedingly unlikely.

Cheers. :)

8 bits, 256 symbols.

7 bits, 128 symbols.

6 bits, -- oh. ... Time for coffee.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact