
Base65536 – Unicode's answer to Base64 - tomjakubowski
https://www.npmjs.com/package/base65536
======
colanderman
This only makes sense if your metric is number of code points. You don't
actually get better density with this.

Base64, encoded as either ASCII or UTF-8, stores 6 bits per byte; making it
75% efficient.

Base65536 encoded as UTF-8 (by far the most common) will split its characters
about evenly between 3-byte and 4-byte encodings (it uses code points up to
165376 roughly evenly distributed [1]). That is 16 bits per 3.5 bytes, or 57%
efficient. Much worse.

UTF-16 (what Javascript uses) is slightly better; you get a distribution about
even between 2 and 4 bytes. That is 16 bits per 3 bytes, or 67% efficient.
Better, but not as good as Base64 in ASCII or UTF-8. (However, it _is_ better
than Base64 in UTF-16, which is only 37.5% efficient.)

Stick with Base64 unless you're strictly counting code points (like tweets
do).

[1] [https://github.com/ferno/base65536/blob/master/get-block-
sta...](https://github.com/ferno/base65536/blob/master/get-block-start.json)

~~~
deathanatos
> This only makes sense if your metric is number of code points.

Not quite! :) I wrote a similar tool I called baseunicode[1]. (Glad I'm not
the only one who's discovered this!) The author here is using it for tweets
(which oddly don't count bytes, they count code points); I wrote my version
for copy/pasting data between terminals[2]: copying more than a screen's worth
is a hassle, so "screen area" is the metric I'm trying to optimize, which ends
up being subtly different from the article's, which is code points for a
tweet. (for screen area in a terminal, some characters — CJK in particular —
end up taking multiple columns; and combining code points (like acute accent)
are basically free!)

In my case, it ended up being easier to use base64, because it's installed by
default.

[2]: SCP is just a pain sometimes. If you need to read a file only privileged
by root, AFAIK, it's out of luck. I'm usually going server-to-server, so I
also need to remember to -3, which I inevitably forget. You can get around the
root issue w/ `ssh source 'sudo cat /foo/bar' | ssh destination 'cat >foo',
(or 'sudo bash -c "cat >foo" if you need root on the write…) but more work;
often I find I'm already there in both terminals and I just want these bytes
from this split in this other split, so I started doing `tar -cz <stuff> |
base64`, copy, `base64 -d | tar -xz`, paste. And then screen size mattered,
thus baseunicode.

[1]:
[https://github.com/thanatos/baseunicode](https://github.com/thanatos/baseunicode)

~~~
adimitrov
> which oddly don't count bytes, they count code points

Not so odd, but very sensible! I'm always saddened by this oversight in SMS
messages — they're 160 7bit characters, so 140 octets. But those 7bit
characters are the GSM charset. If you want to write in other languages than
Western European ones, your message will be encoded using UTF-16 — now you
only have 70 characters. It might be OK for Chinese, but it's absolutely
unusable for Cyrillic.

Of course, the limiting factor in SMS is architectural, whereas in Twitter,
it's philosophical, so there's your difference.

~~~
Dylan16807
There's only so many ways you can slice a kilobyte. Measuring bits is just the
way to do it for data transmissions.

Support for compact representation of more languages would have been nice.
Something like an 8 bit mode with 2 byte escapes for setting code page and 3
byte escapes for arbitrary unicode characters. As a bonus you'd actually be
able to fit _more_ emoji in that mode, 46 vs. 35.

------
spiznnx
I was going to complain about lack of benchmarking, but then I saw the "Why"
section - I wonder what else could be shared using a single tweet.

~~~
TazeTSchnitzel
base65536 can fit at most 280 bytes in a tweet.

Most tweets I make are pure ASCII, which is 7 bit.

(280 * 8) / 7 = 320 septets

Well, 160 septets is the non-Unicode maximum length of an SMS.

So a tweet can fit two SMS messages in it!

~~~
mikeash
You could do slightly better by not restricting yourself to the BMP. Full
Unicode gives you almost 21 bits per code point, which is what Twitter counts.
Round it down to 20 bits per "character" and you have 2800 bits = 350 bytes in
a tweet, or 400 ASCII characters.

~~~
desdiv
Unfortunately that won't work. Twitter doesn't support the full Unicode; they
strip out any Unicode code points that's not used by "normal" people.

~~~
TazeTSchnitzel
> they strip out any Unicode code points that's not used by "normal" people.

Such as?

~~~
goodplay
LTR and RTL marks (all of them), which are used when the default Unicode bidi
algorithm fails to display two intertwined scripts correctly. They are
especially useful when writing short technical tweets.

I bet normal people would use them too if they knew about them and were given
a chance, but normal people can't because twitter doesn't allow codepoints
that aren't used by normal people.

~~~
TazeTSchnitzel
Stripping out LTR and RTL override might be reasonable because they're
frequently abused to screw up website layouts. But stripping _all_ of the
marks is just discrimination against people who need to write bi-directional
text.

~~~
cbd1984
> Stripping out LTR and RTL override might be reasonable because they're
> frequently abused to screw up website layouts.

As long as Twitter allows Zalgo it isn't that concerned about website layouts.

(Zalgo: Rampant abuse of stacking combining diacritical marks above and below
the line. It can get really, really... uh... _impressive_ , in a Lovecraftian
typewriter art kind of way.)

------
TazeTSchnitzel
HATETRIS is pure evil. Try it.

~~~
haddr
i got to 4 lines... incredible!

------
Dylan16807
Base65536 that isn't actually 65536? Odd.

Base32k is a similar idea, but it stores exactly 15 bits per codepoint. It
uses three ranges of unicode, going for simplicity.

Edit: Ah, I see, it uses 65536 + 256 codepoints across two planes for
different numbers of bytes. Which means it is simple in its own way, but it's
also wasting a significant amount of its code space.

~~~
amptorn
> Base65536 that isn't actually 65536? Odd.

Well, Base64 isn't actually 64 either.

------
vorg
Why not use the Chinese characters (Kanji) to encode the moves in
Chinese/Japanese semantics, instead of Base65536 ? It might not be as compact,
but it would still be far more compact than Ascii _AND_ you'd be able to read
it without decoding software if you learn how to read a little Chinese or
Japanese.

~~~
thaumasiotes
It is significantly easier to write a format decoder than to learn to read,
much less write, Chinese or Japanese.

Also, on a pedantic note, kanji are Japanese characters. Chinese characters
are hanzi.

For most characters, they're the same, but there are a few differences.

~~~
vorg
Fair point.

As for the pedantic note, _kanji_ is an English word whereas _hanzi_ isn't, so
that's why I used it.

~~~
thaumasiotes
eh... I think most people who recognize the word "kanji" at all are pretty
well aware that it's a foreign word, and are likely to think of it that way. I
do have a dictionary with an entry for "kanji", but that dictionary includes
plenty of other foreign words too, like "haole" [Hawaiian term for non-
Hawaiians, particularly (often pejorative) white people] (it doesn't have
"hanzi", but it does have an entry for Han). I also have dictionaries that
don't have an entry for "kanji".

------
javajosh
Suggest: change name to Base64k.

~~~
jonesetc
I really like this, but I'm not sure that the joke is obvious enough. Also,
Base64Ki for more precision.

~~~
javajosh
Well maybe it would be less jokey if you called it BaseUTF8 and write it not
use the code points that require more than 8 bits. :)

------
DonHopkins
Is ROT32768 Unicode's answer to ROT13?

~~~
ygra
It would surely solve that pesky problem that double-ROT13 is so insecure.
(Hint: Unicode is larger than the BMP.)

------
lyinsteve
Why did I just implement this in Swift?
[https://github.com/harlanhaskins/Base65536.swift](https://github.com/harlanhaskins/Base65536.swift)

------
2bitencryption
This reminds me of "usenet binaries."

The very existence of binaries on usenet still makes me chuckle. It's such a
strange medium for binaries, and yet it was so prolific.

~~~
inopinatus
NNTP is a general-purpose, globally meshed & distributed, channelized,
eventually-consistent pub/sub bus. With optional ordering, GC, and replay
semantics, and robust and stable implementations of both client and server
protocol available in a variety of language bindings.

So the only real surprise is that it isn't everything's preferred loose-
coupling substrate, but we do love reinventing the wheel.

------
byuu
Is there no end to the horrors spawned through Twitter's continued arbitrary
limitation of 140 code points per tweet? :P

------
fooyc
Reminds me of this:
[https://www.flickr.com/photos/quasimondo/3518306770/in/set-7...](https://www.flickr.com/photos/quasimondo/3518306770/in/set-72057594062596732/)

This is an attempt at encoding images in a tweet, by using Unicode.

------
AndyKelley
when would you ever use this? if you ever transmit unicode data, you have to
encode it first. So one would probably encode this as UTF-8, which is binary.
how about skip the middleman and transmit the binary directly?

~~~
colanderman
This makes sense only if (a) your transport medium will only transfer
graphical Unicode code points, and (b) you are interested in minimizing _code
point_ count as opposed to byte count.

If (a) is not true, you are better off just using binary as you state.

If (b) is not true, you are better off using Base64 if the underlying encoding
is ASCII-compatible (e.g. ASCII, ISO-8859, or UTF-8), or Base4096 if the
underlying encoding is UTF-16.

In the case of tweets, both (a) and (b) _are_ true, making this somewhat
worthwhile.

------
SixSigma
Turn into image

Use pic.Twitter.com

Get 5Mb of data storage

------
jtwebman
I'll stick to base64 or even better base85 (Ascii85).

~~~
Dylan16807
I've used base85 twice. Once internal to a database where it did stupid things
with binary data, that went fine. Once for exporting data, that ended up
breaking at random when formatters saw the tempting asterisks inside. I'd
prefer base64 for most cases.

