Hacker News new | past | comments | ask | show | jobs | submit login

"Base52-encoded random numbers" is a rather obtuse way to describe random letters.



Base52 means capital letters and lower case letters. Base62 includes numbers (0 through 9).

We programmers like being specific. Sometimes these sorts of details matter.


And saying "base52" is misleading. It implies that there is a source data that's been encoded, which is not likely here. You can be specific without implying that.

A random string out of a specific character set is also subtly different from taking random bits and encoding them with that same character set, in that the first couple digits will have different distributions.


> saying "base52" is misleading. It implies that there is a source data that's been encoded

No, it doesn't. 0xFF is a number I just made up, no source data at all, I promise. Also, it's base 16 :)

Anyhow, the source data was most definitely base 2 (as is your computer's memory, I assume) and later encoded into base52 to be represented as a string (unless someone at Microsoft wrote it in base52, which seems unlikely).


> 0xFF is a number I just made up, no source data at all, I promise. Also, it's base 16.

It's not base 16 encoded, which was his point. Encoding demands a source. This is just a base-16 number unless you encoded something to arrive at this. You could interpret "Romeo and Juliet" as a very large base65 number (65 unique chars in the random copy I grabbed) if you want, but it's not meaningful or accurate to call it a base65 encoding.

> Even if that were true, the source data was most definitely encoded from base 2 (which is what our computers work with).

This is the kind of pedantry that people hate because it adds nothing to the conversation. It's a way to inject "I'm right" moments into the conversation so you can feel smart, while no one else really cares. It makes for unpleasant conversations.

<pedantry>

You're also not right. Your brain doesn't work in base-2, and you likely didn't enter this number into your computer in base2. You typed in the string "0xFF", and that string was encoded in base 2. The base2 that represents the string "0xFF" is very different from the base2 number that represents the logical (base16) number 0xFF.

</pedantry>


> It's not base 16 encoded, which was his point.

I didn't understand it that way.

Also, your whole <pedantry> block and the paragraph above is based on misunderstanding my comment (probably because I'm not a native speaker and you caught me inbetween edits).

I think you're the only one making an "unpleasant conversation".


> No, it wasn't (or I didn't understand it that way). He'd have said encoded somewhere.

The original comment did say "encoded". The discussion about the phrase "base52 encoding" was the base of this entire thread. The parent of your original response also used the term "encoded". The context is clear. I don't see how you could have missed it. (Edit: I see you're not a native speaker. That might be part of why we're not understanding each other. Plus I apparently keep replying in between your edits, which happened again.)

> Also, your whole <pedantry> block and the paragraph above is based on misunderstanding my comment

Well, you rewrote the comment after I replied. I assumed your "source data" was your logical 0xFF number. If you were referring to the "source data" for the strings in the update description, then in all likelihood, there was never a "source" number at all. These strings were almost certainly generated via random selection from a set of characters. You could generate a very long number and then base52-encode it to produce the same thing, but it would be more work and less obvious for future code maintainers. So the "source" was a sequence of characters (azAZ), not a base2 number. You could argue that this is still somehow base-2 since it's in a computer, which I guess is fine (if pointless and pedantic), but it's still not accurate to say that these were "encoded" into base52.

I was unnecessarily snarky and rude. I'm sorry for that.


You're right, but the "base52" part isn't what you're right about, it's the 'encoded' part. "base52" is accurate, it is a series of base52 characters, however it doesn't appear to be the product of an encoding.


Base52 is still not really accurate. Base52 implies an encoding. Further, Base52 is not a standard so it's not even meaningful to say that the characters are from the Base52 set. You could also represent Base52 by including 0-9 and excluding Q-Z. Any string that "looks like" Base52 (azAZ) also looks like Base64 and any number of other encodings.


>No, it doesn't. 0xFF is a number I just made up, no source data at all, I promise. Also, it's base 16 :)

But the full quote was "base52-encoded". (Though I would argue that base52 implies encoding, because nothing naturally works in base52. The only thing that's naturally 52 is "random letters with random case". Or something with cards.)

>Anyhow, the source data was most definitely base 2 (as is your computer's memory, I assume) and later encoded into base52 to be represented as a string (unless someone at Microsoft wrote it in base52, which seems unlikely).

That is an enormous assumption. It's easier to pick random letters than it is to take a specific binary number and convert it to letters. And they don't give you the same result. Bits stored in base 52 will never start with zzzzz.


This detail doesn't matter and is needlessly confusing. "Random upper and lowercase letters" is exactly as specific and accurate as "base52-encoded random numbers", but the former is more understandable while the latter is trying way too hard to sound smart.


look, this is hacker news. the author knows his audience. most people who read this will know what base52 is or will at least recognize what it might be and be able to look it up. i dont think it's meant to sound "smart", it's an accurate and concise description of the randomness


It wasn't smart since I literally thought he was saying there was encoded data.


These details can often shed light into what happened ( or was supposed to happen ), what tools were being used, etc. While they might be useless for the average reader, there are many on HN, myself included, who will be dissecting the update to learn more about how Windows update works


That's fine, but it's not useful to use obtuse terminology. Not once have I heard anyone use the term base-52 before today. I understood the term, but it's not common (because base-52 encoding is not common). It's so not common that it doesn't merit a page on Wikipedia, nor a reference from the page for the number 52, nor even a reference from the page for base-64. It's obtuse.

It's also so specific as to be inaccurate. These strings could be interpreted as "base-52", but also as base-64 or any other base greater than 52. Calling them "encoded" also implies a belief about how these were derived that isn't justified. "Encoded" means that there is some original source that can be recreated by decoding. It's possible that these were actually created by generating random numbers and then encoding that data in base-52. I think that's pretty unlikely, though.

So no, I don't think this sheds any light or additional detail. It's inaccurate and misleading and if we're so specific we're making up terminology, then we should also be specific enough to say things like "assumed pseudorandom" rather than "random" when we don't know. Otherwise we're just being obtuse.


I have to agree --- apart from anything else, 'base-52' doesn't make it clear that the encoding alphabet is, in fact, made up of letters. It'd be just as valid to use 0-9, A-Z and a-p.

'base-64' is at least a defined term with a defined alphabet.


> We programmers like being specific.

But above all, we like being pedantic. (Not you.) :)


[a-Z] and [a-Z0-9] would be a better representation, no?


[flagged]


> Don't come to HN if you're expecting everything to be in layman's terms

That's not nice. Please don't do that.

I doubt that dpark is "expecting everything to be in layman's terms", but even if dpark were, many HN users would be happy to explain. And that's the kind of site we want.


Then why describe being pedantic (extremely specific) as being "obtuse" (annoyingly insensitive)?


Because the specificity here is inaccurate and misleading. I've left plenty of comments about what's wrong with the term "base52-encoded random numbers" here and so have a number of other commenters. I didn't ask for anyone to use lay terminology, but it's reasonable to ask for clear and correct terminology.

One of the frustrating things about reading academic papers is that many authors insist on using less common and less clear phrasing when there are much simpler ways of saying the same things. It makes for unpleasant, heavy reading and it makes the author sound pompous.

"Obtuse" also means dumb, and is often used to mean "difficult to understand". "Abstruse" might have been a better choice of word.


It's not random letters. It's random numbers encoded in base52. Because the number strings are encoded, they're probably not random at all.


You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them.


    hobbes@namagiri:~/scratch$ echo gYxseNjwafVPfgsoHnzLblmmAxZUiOnGcchqEAEwjyxwjUIfpXfJQcdLapTmFaqHGCFsdvpLarmPJLOZYMEILGNIPwNOgEazuBVJcyVjBRL|ent
    Entropy = 5.352821 bits per byte.
    
    Optimum compression would reduce the size
    of this 108 byte file by 33 percent.
    
    Chi square distribution for 108 samples is 650.52, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 92.6019 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.142915 (totally uncorrelated = 0.0).
    
...Which is meaningless without context. Let's see if we can create some context...

    hobbes@namagiri:~/scratch$ echo "You have no way of knowing that. All you see are some random-looking strings. These could be numbers (random or not) that were encoded as base52. These could also be numbers encoded as base64, or base88, or any other base52+. Or they could be randomly-generated strings and there is no meaningful number underlying them." |base64 |sed -e :a -e '$!N; s/\n//; ta'
    WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K
ok, that's base64. Let's turn that into base52...

    CL-USER> (to-base 52 (from-base 64 "WW91IGhhdmUgbm8gd2F5IG9mIGtub3dpbmcgdGhhdC4gQWxsIHlvdSBzZWUgYXJlIHNvbWUgcmFuZG9tLWxvb2tpbmcgc3RyaW5ncy4gVGhlc2UgY291bGQgYmUgbnVtYmVycyAocmFuZG9tIG9yIG5vdCkgdGhhdCB3ZXJlIGVuY29kZWQgYXMgYmFzZTUyLiBUaGVzZSBjb3VsZCBhbHNvIGJlIG51bWJlcnMgZW5jb2RlZCBhcyBiYXNlNjQsIG9yIGJhc2U4OCwgb3IgYW55IG90aGVyIGJhc2U1MisuIE9yIHRoZXkgY291bGQgYmUgcmFuZG9tbHktZ2VuZXJhdGVkIHN0cmluZ3MgYW5kIHRoZXJlIGlzIG5vIG1lYW5pbmdmdWwgbnVtYmVyIHVuZGVybHlpbmcgdGhlbS4K"))
    "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO"
Astute readers will note that this output contains numeric digits. There's more than one representation of "base-52". The one above goes from 0 through p. An alternative would go from A-z. But none of that matters to measure the entropy of base-52 encoded English text, ya dig?

    hobbes@namagiri:~/scratch$ echo "6lH81qbtcGjfFOclhdtt7ieuljADG17Ou6swolu2A2qlnO52zmMkG8Nfk2APbEbso4idFF44wsLIGfrOg5atP0ucg5gubxAcD9ztMcboeC4sAui28skbwtEiuv64OuD8fbFnn1M01oeq3bH7n9bvu8p3P1MwirdDHxKDONktDvtNOLE1srOz2I4wNLsBpgOGIlLs1i11xt58JpOC1whJ54Krmln1ahmrvksODe7kqjtBazKzKamu5tygI6hGHq3h123Ighyw8s2MxE4dl9rBdBNG1o7tJM2HvzOh955NpLB33nuPr4OwfhjB618y1BP9y2euMquIMszOuH1rPAEkBccOu9qIJBqKgGf1qH42bBd7GOMsbKExosd8CErJlMIAcEyyytFzoKOfkJy1ExEnKc0iE7OGkpLJM3ybB2JNlnwtrC5F0cH8kB41IIv0BwNk9DO" | ent
    Entropy = 5.610518 bits per byte.
    
    Optimum compression would reduce the size
    of this 452 byte file by 29 percent.
    
    Chi square distribution for 452 samples is 2093.27, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 86.3496 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is 0.018253 (totally uncorrelated = 0.0).
Conclusion: I don't think the original text was pseudorandom. If you can convince me that the original isn't "base-52 encoded" at all (and I'd be easily convinced), I'm willing to reevaluate that conclusion. I'm also interested to see if anyone sees a flaw in my process.


Well, we know that it was a test update that was released unintentionally. http://www.zdnet.com/article/microsoft-accidentally-issued-a...

So the question is why a test update would have "hidden" meaning underneath the random-looking strings.

As for the flaw in your methodology, Your ent command (not familiar with this, so just basing this off what I see) is assuming full use of the binary space (hence assuming 127.5 as mean of random data). No base52 data will use the full 256-value space, by definition. Base52 (azAZ) will actually have a mean of 93.5 for random data, extremely close to the measured mean. Serial correlation coefficient should also be higher for azAZ than 09azAP, because more of the alphabet is contiguous.

I did a little bit of analysis on the data as well to determine if the data was random or gibberish typed by a human on a keyboard, and found that most of the data lined up well for true (or pseudo) random. (Ugly code here: http://pastebin.com/9YN93xhi)

  Home row % expected: 34.6% 
  Home row % actual: 38.3%
  Expected upper: 50%
  Actual upper: 47.7%
  Expected sequential case match: 50%
  Actual sequential case match: 55.7%


Sure, I see the zdnet article. We could still play this game with any arbitrary string, though!

Regarding the assumption of full use of the binary space... May I phrase that differently? azAZ and 09azAP both leave big swaths of the space as always-zero or perhaps always-one. Another way to phrase it: you could write out an azAZ string using five-bit characters and have room to spare.

The swaths of emptiness in the range of possible values is why I took an English sample and did my best to encode it the same way as the original was, so that we'd have an apples-to-apples comparison. The English sample had higher entropy than the original--after "accounting for" encoding.

You are right, my method of "accounting for" encoding did nothing for the "serial correlation coefficient" metric. I didn't know what that was until you mentioned it, thanks.

Your actual/expected analysis is a good idea, though it took me a minute to understand what you were doing. I guess: "If these are truly random digits from azAZ, then half of them will be uppercase." Indeed. But... I just read an article[0] which convinced me that I have no idea how "random" works. (Specifically this bit: As an example, the probability of having exactly 100 cars is 0.056 — this perfectly balanced situation will only happen 1 in 18 times.

Cheers.

0: https://www.quantamagazine.org/20150925-solution-the-road-le...


Your understanding of base52 leaving large amounts of empty space is correct (you need 6 bits to fit it, though). That space throws off the "ent" measure. The 2.3 unused bits in every char dwarfs the entropy difference between random data and encoded text. This would probably be easier to see if you'd added in known pseudorandom strings for more context. (I see kaoD provided a Base64 example.)

Your understanding of "random" probably isn't that far off. After all, "the average number of cars on each road will be 100". When you look at random numbers they do the expected thing in aggregate. Individual samplings will vary, though. You could get all 200 cars on one road randomly. It's just exceedingly unlikely.

Cheers. :)


8 bits, 256 symbols.

7 bits, 128 symbols.

6 bits, -- oh. ... Time for coffee.


    $ dd if=/dev/urandom bs=512 count=1 2>/dev/null | base64 -w 0 | ent
    Entropy = 5.933850 bits per byte.
    
    Optimum compression would reduce the size
    of this 684 byte file by 25 percent.

    Chi square distribution for 684 samples is 2310.15, and randomly
    would exceed this value 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 85.0731 (127.5 = random).
    Monte Carlo value for Pi is 4.000000000 (error 27.32 percent).
    Serial correlation coefficient is -0.018269 (totally uncorrelated = 0.0).
I don't have the means to convert base64 to 52 but the result shouldn't be much different.


Oh, I didn't expect this. Thank you!


That's you should always test both what proves and what disproves your theory :)


I don't think anyone actually knows that. Which is exactly why it was inaccurate to describe it as "Base52", it made you think someone did know that.




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: