Hacker News new | past | comments | ask | show | jobs | submit login
Base65536 – Unicode's answer to Base64 (npmjs.com)
137 points by tomjakubowski on Dec 6, 2015 | hide | past | web | favorite | 50 comments



This only makes sense if your metric is number of code points. You don't actually get better density with this.

Base64, encoded as either ASCII or UTF-8, stores 6 bits per byte; making it 75% efficient.

Base65536 encoded as UTF-8 (by far the most common) will split its characters about evenly between 3-byte and 4-byte encodings (it uses code points up to 165376 roughly evenly distributed [1]). That is 16 bits per 3.5 bytes, or 57% efficient. Much worse.

UTF-16 (what Javascript uses) is slightly better; you get a distribution about even between 2 and 4 bytes. That is 16 bits per 3 bytes, or 67% efficient. Better, but not as good as Base64 in ASCII or UTF-8. (However, it is better than Base64 in UTF-16, which is only 37.5% efficient.)

Stick with Base64 unless you're strictly counting code points (like tweets do).

[1] https://github.com/ferno/base65536/blob/master/get-block-sta...


"This only makes sense if your metric is number of code points."

That is precisely the author's metric. In the "why?" section they write: "I wanted people to be able to share HATETRIS replays via Twitter. Twitter supports tweets of up to 140 characters. "


Yes, I read the article and mentioned its use case of tweets in the last sentence of my post.


> This only makes sense if your metric is number of code points.

Not quite! :) I wrote a similar tool I called baseunicode[1]. (Glad I'm not the only one who's discovered this!) The author here is using it for tweets (which oddly don't count bytes, they count code points); I wrote my version for copy/pasting data between terminals[2]: copying more than a screen's worth is a hassle, so "screen area" is the metric I'm trying to optimize, which ends up being subtly different from the article's, which is code points for a tweet. (for screen area in a terminal, some characters — CJK in particular — end up taking multiple columns; and combining code points (like acute accent) are basically free!)

In my case, it ended up being easier to use base64, because it's installed by default.

[2]: SCP is just a pain sometimes. If you need to read a file only privileged by root, AFAIK, it's out of luck. I'm usually going server-to-server, so I also need to remember to -3, which I inevitably forget. You can get around the root issue w/ `ssh source 'sudo cat /foo/bar' | ssh destination 'cat >foo', (or 'sudo bash -c "cat >foo" if you need root on the write…) but more work; often I find I'm already there in both terminals and I just want these bytes from this split in this other split, so I started doing `tar -cz <stuff> | base64`, copy, `base64 -d | tar -xz`, paste. And then screen size mattered, thus baseunicode.

[1]: https://github.com/thanatos/baseunicode


> which oddly don't count bytes, they count code points

Not so odd, but very sensible! I'm always saddened by this oversight in SMS messages — they're 160 7bit characters, so 140 octets. But those 7bit characters are the GSM charset. If you want to write in other languages than Western European ones, your message will be encoded using UTF-16 — now you only have 70 characters. It might be OK for Chinese, but it's absolutely unusable for Cyrillic.

Of course, the limiting factor in SMS is architectural, whereas in Twitter, it's philosophical, so there's your difference.


There's only so many ways you can slice a kilobyte. Measuring bits is just the way to do it for data transmissions.

Support for compact representation of more languages would have been nice. Something like an 8 bit mode with 2 byte escapes for setting code page and 3 byte escapes for arbitrary unicode characters. As a bonus you'd actually be able to fit more emoji in that mode, 46 vs. 35.


If screen area is your metric, and combining code points are free, I bet you could use [1] to add an arbitrary amount of data to a single character.

[1] http://www.eeemo.net/


I do this too. xclip is your friend.


You could however used the code points in a weighted matter and get better efficiency.


Only if you are forced to define your "code points" as UTF-8. Maybe due to some dependencies. But it is usually the opposite, dependencies tend to limit you to ASCII if they ever.


I was going to complain about lack of benchmarking, but then I saw the "Why" section - I wonder what else could be shared using a single tweet.


256-byte demos come to mind:

http://www.pouet.net/prodlist.php?type%5B%5D=256b

A little effect in 256 bytes of binary code. Some of them are amazing:

http://www.pouet.net/prod.php?which=66380


base65536 can fit at most 280 bytes in a tweet.

Most tweets I make are pure ASCII, which is 7 bit.

(280 * 8) / 7 = 320 septets

Well, 160 septets is the non-Unicode maximum length of an SMS.

So a tweet can fit two SMS messages in it!


You could do slightly better by not restricting yourself to the BMP. Full Unicode gives you almost 21 bits per code point, which is what Twitter counts. Round it down to 20 bits per "character" and you have 2800 bits = 350 bytes in a tweet, or 400 ASCII characters.


This is dangerous through, because sending a Unicode string through Twitter subjects it to NFC normalization which can alter its structure. If you want guarantees that this won't happen, your choices of code point are far more constrained, and the unassigned code points, which is most of them, can't even be used. I think this is the purpose of base65536gen.

https://www.npmjs.com/package/base65536gen


Unfortunately that won't work. Twitter doesn't support the full Unicode; they strip out any Unicode code points that's not used by "normal" people.


Are you sure? I just posted a tweet in Cuneiform, which is outside the BMP. So their view of "normal people" seems to encompass ancient Sumerians, at least.


Interesting, is there a list of what they strip out? I know they apply normalization so that would eliminate a few, but I think that wouldn't do too much on its own.


> they strip out any Unicode code points that's not used by "normal" people.

Such as?


LTR and RTL marks (all of them), which are used when the default Unicode bidi algorithm fails to display two intertwined scripts correctly. They are especially useful when writing short technical tweets.

I bet normal people would use them too if they knew about them and were given a chance, but normal people can't because twitter doesn't allow codepoints that aren't used by normal people.


Stripping out LTR and RTL override might be reasonable because they're frequently abused to screw up website layouts. But stripping all of the marks is just discrimination against people who need to write bi-directional text.


> Stripping out LTR and RTL override might be reasonable because they're frequently abused to screw up website layouts.

As long as Twitter allows Zalgo it isn't that concerned about website layouts.

(Zalgo: Rampant abuse of stacking combining diacritical marks above and below the line. It can get really, really... uh... impressive, in a Lovecraftian typewriter art kind of way.)


The issue of effecting web layout can be resolved by simply appending the appropriate resetting codepoint to the tweets. (U+202C for bidi overrides)


Well, null, to start with. Other ASCII control characters. Consecutive line breaks, too.


HATETRIS is pure evil. Try it.


i got to 4 lines... incredible!


Base65536 that isn't actually 65536? Odd.

Base32k is a similar idea, but it stores exactly 15 bits per codepoint. It uses three ranges of unicode, going for simplicity.

Edit: Ah, I see, it uses 65536 + 256 codepoints across two planes for different numbers of bytes. Which means it is simple in its own way, but it's also wasting a significant amount of its code space.


> Base65536 that isn't actually 65536? Odd.

Well, Base64 isn't actually 64 either.


Why not use the Chinese characters (Kanji) to encode the moves in Chinese/Japanese semantics, instead of Base65536 ? It might not be as compact, but it would still be far more compact than Ascii AND you'd be able to read it without decoding software if you learn how to read a little Chinese or Japanese.


Because you couldn't find a meaningful way to assign Chinese characters to represent each possible combination of two of {left, right, down, rotate}, which is what the previous hex encoding mentioned can do, let alone do better. (As a novice, I can think of relatively reasonable ways to represent some of them, but no way to account for order. For example, "down left" could be "southwest" via the Zodiac animal, "left right" could be "horizontal", and "down rotate" could be "screw"... but that wouldn't help with "left down", "right left", etc.)


It is significantly easier to write a format decoder than to learn to read, much less write, Chinese or Japanese.

Also, on a pedantic note, kanji are Japanese characters. Chinese characters are hanzi.

For most characters, they're the same, but there are a few differences.


Fair point.

As for the pedantic note, kanji is an English word whereas hanzi isn't, so that's why I used it.


eh... I think most people who recognize the word "kanji" at all are pretty well aware that it's a foreign word, and are likely to think of it that way. I do have a dictionary with an entry for "kanji", but that dictionary includes plenty of other foreign words too, like "haole" [Hawaiian term for non-Hawaiians, particularly (often pejorative) white people] (it doesn't have "hanzi", but it does have an entry for Han). I also have dictionaries that don't have an entry for "kanji".


Suggest: change name to Base64k.


I really like this, but I'm not sure that the joke is obvious enough. Also, Base64Ki for more precision.


Well maybe it would be less jokey if you called it BaseUTF8 and write it not use the code points that require more than 8 bits. :)


Why did I just implement this in Swift? https://github.com/harlanhaskins/Base65536.swift


Is ROT32768 Unicode's answer to ROT13?


It would surely solve that pesky problem that double-ROT13 is so insecure. (Hint: Unicode is larger than the BMP.)


This reminds me of "usenet binaries."

The very existence of binaries on usenet still makes me chuckle. It's such a strange medium for binaries, and yet it was so prolific.


NNTP is a general-purpose, globally meshed & distributed, channelized, eventually-consistent pub/sub bus. With optional ordering, GC, and replay semantics, and robust and stable implementations of both client and server protocol available in a variety of language bindings.

So the only real surprise is that it isn't everything's preferred loose-coupling substrate, but we do love reinventing the wheel.


It's not that strange if you think of things like email attachments... or do you also consider that to be "a strange medium for binaries"?


Is there no end to the horrors spawned through Twitter's continued arbitrary limitation of 140 code points per tweet? :P


Reminds me of this: https://www.flickr.com/photos/quasimondo/3518306770/in/set-7...

This is an attempt at encoding images in a tweet, by using Unicode.


when would you ever use this? if you ever transmit unicode data, you have to encode it first. So one would probably encode this as UTF-8, which is binary. how about skip the middleman and transmit the binary directly?


This makes sense only if (a) your transport medium will only transfer graphical Unicode code points, and (b) you are interested in minimizing code point count as opposed to byte count.

If (a) is not true, you are better off just using binary as you state.

If (b) is not true, you are better off using Base64 if the underlying encoding is ASCII-compatible (e.g. ASCII, ISO-8859, or UTF-8), or Base4096 if the underlying encoding is UTF-16.

In the case of tweets, both (a) and (b) are true, making this somewhat worthwhile.


It's explained in the linked post - people wanted to share HATETRIS replays through Twitter, and this helps that happen.


Turn into image

Use pic.Twitter.com

Get 5Mb of data storage


I'll stick to base64 or even better base85 (Ascii85).


I've used base85 twice. Once internal to a database where it did stupid things with binary data, that went fine. Once for exporting data, that ended up breaking at random when formatters saw the tempting asterisks inside. I'd prefer base64 for most cases.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: