Hacker News new | past | comments | ask | show | jobs | submit login
Binary to text encoding – state of the art and missed opportunities (volution.ro)
42 points by todsacerdoti on Feb 6, 2023 | hide | past | favorite | 14 comments



> Setting computing history aside, some of the dinosaurs that still require them (like SMTP), and various esotericisms (like data: URI's), not many people use such encoding schemes.

Literally every user of git uses these schemes routinely. Hashes and UUIDs (and friends like ULID) are binary data and almost always presented textually.

> There are however still a few cases where they are the proper tool

Notably missing: allowing user input of binary data, and passing binary data through text-only channels (e.g. JSON does not support raw binary data, so b2t is necessary to send binary data through any JSON-based channel).

> see the section about efficiency above; although this isn't strictly related to binary to text encoding, none of the current tools offer the option to transparently compress / decompress on encoding / decoding

Because compression is completely orthogonal, and there’s many situations where you’re better off leaving it out e.g. compress, b2t, put it into JSON, and send that with DEFLATE transfer-encoding. The first compression is largely wasted, you’d almost certainly be better off only compressing the JSON.

And obviously if you put a compression step after the b2t encoding why did you b2t encode in the first place?


> Because compression is completely orthogonal

I agree. This was my thought throughout the article. Most of the features he wanted like checksumming and compression would be better provided on other layers. I think the reason that base64 is so popular is that it does one thing well. You can always add checksumming, but it often isn't needed.


> (PGP words) -- (that I can't seem to find the proper name for) which is tailored to encoding small pieces of binary data and verify (or transmit) it verbally (in the original case via phones);

Interesting!

> They are analogous in purpose to the NATO phonetic alphabet used by pilots, except a longer list of words is used, each word corresponding to one of the 256 distinct numeric byte values.

https://en.wikipedia.org/wiki/PGP_word_list

I have a project like this that I created once, but using a different set of words.

https://github.com/ctsrc/Base256

I'm gonna go ahead and replace the word list there with the PGP word list actually! :D (It's still pre-1.0, so breaking changes like this are fine.)


Article doesn't mention yEnc, that is (or was?) very popular for binary attachments on usenet.

from wikipedia https://en.wikipedia.org/wiki/YEnc

> yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. It reduces the overhead over previous US-ASCII-based encoding methods by using an 8-bit encoding method. yEnc's overhead is often (if each byte value appears approximately with the same frequency on average) as little as 1–2%, compared to 33–40% overhead for 6-bit encoding methods like uuencode and Base64.


It is not really binary-to-text though, it is more "binary to binary which can pass through common NNTP servers/clients as long as common encodings like latin1 are used"

The last part is critical - yEnc is incompatible with more complex encodings like UTF-8, which is why it is completely useless today.


OK, so. Last weekend I was faffing about with trs80gp, trying to get Xenix going on an emulated TRS-80 Model 16, which I did. But there was some software that I wanted to install that only existed in dd'ed copies off floppy disks, which trs80gp didn't recognize as disk images. What to do, what to do.

A cursory check revealed that Xenix had uuencode and uudecode! So I uuencoded the disk images on the host system, pasted them into the emulator with its ability to type out entire files at the virtual keyboard, then uudecoded them in the Xenix guest and used Xenix to blit them onto fresh floppy images. Done! I had install disks.

I'm sure that Alaithea Moondreemur, the striped-socks-wearing pink-haired unicorn fox, and hir headmates will be along by and by to tell us that uuencode/uudecode are obsolete, insecure, and should not be included in modern distros because, news flash, it's not the 80s anymore and modern applications use 8-bit-clean binary protocols anyway. But this opportunity to get my 2023 Linux to talk to the first Unix I ever used... man, it was a trip!


A seemingly simple question which will cause you no end of misery if the answer is not the easy one: is a piece of UTF-8 "binary" or "text"?


UTF-8 is both "binary" and "text" at the same time. :)

It's binary because the UTF-8 standard states how each Unicode code-point (i.e. character) is to be translated into a series of bytes.

But, because each (correct) UTF-8 byte sequence can be translated back into a Unicode code-point sequence, you can see it as text also. :)

(BTW, from my knowledge, UTF-8 doesn't specify a canonical encoding of Unicode text, thus for cryptographic purposes, especially signatures, perhaps one should treat it with care.)


AFAIK, overlong encodings are not valid UTF-8 [0] which means the canonical encoding of each unicode code point is clearly defined as there is only one valid encoding. That still leaves higher level Unicode shenanigans, but every Unicode encoding is going to have those. And of course, what applications accept in practice is an entirely different issue.

[0] https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings


While the point on overlong encodings is true... there are multiple ways to do composition[1]. brown-skin + high-five or high-five + brown skin... etc.

Part of why it's a good idea to normalize input before password hashing, as an example... It will likely become more common over time to use emoji as passphrase input.

1. https://en.wikipedia.org/wiki/Unicode_equivalence


Yes, composing characters are what I was referencing with "higher level Unicode shenanigans". This doesn't stop there though - many people would say that "а" and "a" are encodings of the same character even if Unicode thinks otherwise. All that is above the concerns of UTF-8 though, which only cares about encoding code points into byte sequences.


If your encoding is anything other than single byte characters with an upper bound of 0x7F, then indeed - you're probably using a binary encoding that is often rendered as 'text'. And it is going to hurt.


I somehow assumed this was going to be about using language models to compile text/language directly to executable code


as someone who had extensively dealt with binary to text encoding, I'm very glad that an algorithm with the features you're proposing would get zero adoption and I'll never have to deal with it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: