Base64, encoded as either ASCII or UTF-8, stores 6 bits per byte; making it 75% efficient.
Base65536 encoded as UTF-8 (by far the most common) will split its characters about evenly between 3-byte and 4-byte encodings (it uses code points up to 165376 roughly evenly distributed ). That is 16 bits per 3.5 bytes, or 57% efficient. Much worse.
Stick with Base64 unless you're strictly counting code points (like tweets do).
That is precisely the author's metric. In the "why?" section they write:
"I wanted people to be able to share HATETRIS replays via Twitter. Twitter supports tweets of up to 140 characters. "
Not quite! :) I wrote a similar tool I called baseunicode. (Glad I'm not the only one who's discovered this!) The author here is using it for tweets (which oddly don't count bytes, they count code points); I wrote my version for copy/pasting data between terminals: copying more than a screen's worth is a hassle, so "screen area" is the metric I'm trying to optimize, which ends up being subtly different from the article's, which is code points for a tweet. (for screen area in a terminal, some characters — CJK in particular — end up taking multiple columns; and combining code points (like acute accent) are basically free!)
In my case, it ended up being easier to use base64, because it's installed by default.
: SCP is just a pain sometimes. If you need to read a file only privileged by root, AFAIK, it's out of luck. I'm usually going server-to-server, so I also need to remember to -3, which I inevitably forget. You can get around the root issue w/ `ssh source 'sudo cat /foo/bar' | ssh destination 'cat >foo', (or 'sudo bash -c "cat >foo" if you need root on the write…) but more work; often I find I'm already there in both terminals and I just want these bytes from this split in this other split, so I started doing `tar -cz <stuff> | base64`, copy, `base64 -d | tar -xz`, paste. And then screen size mattered, thus baseunicode.
Not so odd, but very sensible! I'm always saddened by this oversight in SMS messages — they're 160 7bit characters, so 140 octets. But those 7bit characters are the GSM charset. If you want to write in other languages than Western European ones, your message will be encoded using UTF-16 — now you only have 70 characters. It might be OK for Chinese, but it's absolutely unusable for Cyrillic.
Of course, the limiting factor in SMS is architectural, whereas in Twitter, it's philosophical, so there's your difference.
Support for compact representation of more languages would have been nice. Something like an 8 bit mode with 2 byte escapes for setting code page and 3 byte escapes for arbitrary unicode characters. As a bonus you'd actually be able to fit more emoji in that mode, 46 vs. 35.
A little effect in 256 bytes of binary code. Some of them are amazing:
Most tweets I make are pure ASCII, which is 7 bit.
(280 * 8) / 7 = 320 septets
Well, 160 septets is the non-Unicode maximum length of an SMS.
So a tweet can fit two SMS messages in it!
I bet normal people would use them too if they knew about them and were given a chance, but normal people can't because twitter doesn't allow codepoints that aren't used by normal people.
As long as Twitter allows Zalgo it isn't that concerned about website layouts.
(Zalgo: Rampant abuse of stacking combining diacritical marks above and below the line. It can get really, really... uh... impressive, in a Lovecraftian typewriter art kind of way.)
Base32k is a similar idea, but it stores exactly 15 bits per codepoint. It uses three ranges of unicode, going for simplicity.
Edit: Ah, I see, it uses 65536 + 256 codepoints across two planes for different numbers of bytes. Which means it is simple in its own way, but it's also wasting a significant amount of its code space.
Well, Base64 isn't actually 64 either.
Also, on a pedantic note, kanji are Japanese characters. Chinese characters are hanzi.
For most characters, they're the same, but there are a few differences.
As for the pedantic note, kanji is an English word whereas hanzi isn't, so that's why I used it.
The very existence of binaries on usenet still makes me chuckle. It's such a strange medium for binaries, and yet it was so prolific.
So the only real surprise is that it isn't everything's preferred loose-coupling substrate, but we do love reinventing the wheel.
This is an attempt at encoding images in a tweet, by using Unicode.
If (a) is not true, you are better off just using binary as you state.
If (b) is not true, you are better off using Base64 if the underlying encoding is ASCII-compatible (e.g. ASCII, ISO-8859, or UTF-8), or Base4096 if the underlying encoding is UTF-16.
In the case of tweets, both (a) and (b) are true, making this somewhat worthwhile.
Get 5Mb of data storage