
Base 65536 - leoh
https://github.com/qntm/base65536
======
tomxor
Similarly we do a lot of unicode bit packing on Dwitter.net since the code
size limit is in characters (140) not bytes.

There are a few compressors allowing an increase of arbitrary ASCII chars,
with the most allowing 194 chars including decoder, the compressors are
themselves dweets! :
[https://www.dwitter.net/d/11852](https://www.dwitter.net/d/11852) This is not
the most efficient way to store bits in unicode but there is a balance with
dwitter because you need to fit the decoder in the same 140 characters.

There are also some more specialist methods such as packing image or other
binary data and decode using string.charCodeAt() e.g
[https://www.dwitter.net/d/12808](https://www.dwitter.net/d/12808) this is not
limited to exact bit word sizes, for instance I packed this 3 color image
efficiently by encoding it numerically in base3, ~1.6 bits per pixel, the
decoding just requires division by power's of 3 and mod3 (generalization of
bitwise shifting and masking):
[https://www.dwitter.net/d/19166](https://www.dwitter.net/d/19166)

[edit] I should explain, that last example combines the general code
compressor with the image bit packing technique, the image it encoded as 5
pixels (3 color) per 8bit characters but doesn't use the higher bits of
unicode to allow the whole code to be encoded using the above compressor which
fits two chars up to 8bit into UTF-16 and wraps it in the decoder.

You can see the source by replacing "eval" with "throw".

~~~
srtjstjsj
How about adding standard decompressers to the Dwitter JS environment so they
can be executed with less code?

~~~
tomxor
Do you mean to add the decompressor wrapper automatically so that more UTF-16
packed code can fit?

That would effectively increase the limit to 280 usable ASCII chars, i.e a
double dweet. This has been discussed before but I don't think the authors
will implement it.

There are a few other JS golfing sites that have multiple size brackets and I
believe they are less active as a result. Although dwitter is not exactly
popular I think this clear limitation is what keeps it focused and the
community actively engaged. The 194B hack is a slight weakening of that, but
it's often reflected by votes as people are less impressed when you've
resorted to using that trick unless it was really worth it... it also feels
like a fair compromise because you cannot for instance use emojis inside that
compression method so it's not entirely free.

------
bazzargh
For those looking for motivation - you can see examples of the related
base2048 encoding used in tweets to the bbc micro bot twitter account:
[https://www.dompajak.com/bbcmicrobot.html](https://www.dompajak.com/bbcmicrobot.html)
...encoding and decoding (to learn from other's coding tricks) is easy with
the editor:
[https://editor.8bitkick.cc/?compress=true](https://editor.8bitkick.cc/?compress=true)

without these tools I'd never have managed to fit this in a tweet:
[https://twitter.com/bbcmicrobot/status/1248312174054948864](https://twitter.com/bbcmicrobot/status/1248312174054948864)

~~~
NelsonMinar
what a thing of beauty

------
ithkuil
Initially designed to pack more data in a tweet. Thanks to this article I
learned about:

[https://blog.twitter.com/engineering/en_us/topics/insights/2...](https://blog.twitter.com/engineering/en_us/topics/insights/2017/Our-
Discovery-of-Cramming.html)

~~~
laurent92
So they have differing tweet length according to the UTF8 pane it’s using. But
Germans are still cheated compared to English, which is a quite concise
language. (of course, the result focuses on Japanese because of the huge
difference to latin languages)

------
jwilk
Discussed in 2017:

[https://news.ycombinator.com/item?id=14468818](https://news.ycombinator.com/item?id=14468818)
(122 comments)

In 2015:

[https://news.ycombinator.com/item?id=10683889](https://news.ycombinator.com/item?id=10683889)
(50 comments)

------
userbinator
_Base65536 uses only "safe" Unicode code points - no unassigned code points,
no whitespace, no control characters, etc.._

 _The string encodes two bytes per code point._

Is it really 64k then? It is hard to understand how this works without
documentation on the format.

I wonder how many others were able to mentally decode the example input string
just on sight, from having worked with ASCII extensively.

~~~
Sniffnoy
It encodes two bytes per code point; the encoding of that code point may be
longer than two bytes. (Indeed it mentions that it's optimized for UTF-32.) It
accomplishes having the full 2^16, while only using safe characters, by not
restricting to the BMP.

If you read the details of how it was constructed, you can see that the reason
the 2^17 version isn't ready yet is because there aren't yet 2^17 safe
characters in Unicode! (There are a total of 123,813 characters in Unicode 13
that are judged "safe" by qntm's criteria.)

------
octoberfranklin
[https://github.com/qntm/base65536#why](https://github.com/qntm/base65536#why)

> Why?

> Erm.

> I wanted people to be able to share HATETRIS replays via Twitter.

My brain is melting.

~~~
yebyen
So I went ahead and allowed myself to ask, "why is it called HATETRIS" and
tried it out without reading anything first.

This has been a wonderful rollercoaster, thank you for sharing and pointing
this out :D I only played the game for 30 seconds before I could already guess
exactly why it's called that!

~~~
lisper
HATETRIS is actually a surprisingly engaging challenge, particularly since it
has no time constraints. After a few minutes I found I enjoyed it at least as
much if not more than regular Tetris. Figuring out how to make all the pieces
appear makes quite the juicy little puzzle, as does figuring out how to
maximize the number of cleared lines.

~~~
yebyen
I was not able to clear more than 2 lines per game. I would like to become
good at this game. I'm not sure it's possible. But the prospect of a community
that shares hatetris replays is intriguing...

------
Slartie
I love this! Idea, execution, everything.

This reminds me of the times when I encoded a binary database of several
megabytes, with indices and everything, in the 53 integer bits of huge 64-bit
floating point number arrays, because that was the only way to use that data
in a Lua runtime environment of a computer games' UI that didn't expose direct
file system access to UI plugins.

------
toastal
I want to marvel at the characters and languages our species has created and
we've also found a way to encode and display on computers.

------
kps
> _maximum Tweet length is indeed 280 Unicode code points, except that code
> points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double._

Oh, _that 's_ why some carefully composed tweets don't fit — things like
General Punctuation and Currency Symbols count double. I think I'm up to the
Gillette5 version of Hanlon's Razor where Twitter is concerned.

------
lifthrasiir
Superseded by Base2048 [1] by now.

[1] [https://github.com/qntm/base2048](https://github.com/qntm/base2048)

~~~
thih9
Assuming the goal is to transmit data via Twitter.

Other environments may treat characters after U+1100 with no penalties.

See: “Rationale” section at:
[https://github.com/qntm/base2048#rationale](https://github.com/qntm/base2048#rationale)

> following some experimentation and examination of the new web client code,
> we now understand that maximum Tweet length is indeed 280 Unicode code
> points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now
> count double.

~~~
lifthrasiir
You are right that Base2048 is tailor-made for Twitter, but I think so is
Base65536.

More specifically, the explicit length limitation is easily one of the
weirdest aspects of Twitter. It makes Twitter experience somewhat like that of
IRC, faciliating knee-jerk reactions and misunderstandings. And the limit is
severe even for casual conversation to the point that it had to be changed
once emojis broke its core assumptions. I don't think Base65536 has much use
outside of Twitter and likes.

~~~
Dylan16807
Base65536 is useful anywhere people are pasting codes to each other with sizes
around 20-1000 bytes. Taking up half as much screen space as base64 is a big
feature.

------
ignoramous
I needed something similar for a javascript project and stumbled upon this
excellent library by pieroxy that worked flawlessly:
[https://pieroxy.net/blog/pages/lz-
string/index.html](https://pieroxy.net/blog/pages/lz-string/index.html)

base65536 looks pretty similar to that except that base65536 is not as much
focused on compression than it is on "copy-paste-ability".

Shoco is another excellent library that comes to mind for shorter strings:
[https://ed-von-schleck.github.io/shoco/](https://ed-von-
schleck.github.io/shoco/)

------
outsomnia
Mightn't the raw data be a good candidate for huffman encoding

[https://en.wikipedia.org/wiki/Huffman_coding](https://en.wikipedia.org/wiki/Huffman_coding)

~~~
lifthrasiir
Base65536 is not a compression algorithm but rather an encoding scheme that
can convey any data compressed or not. Compression algorithms have an implicit
input of the list of output symbols, which is commonly but not necessarily an
octet (0 through 255). If your output format is the tweet you will eventually
end up with something like Base65536 for output symbols.

~~~
GuB-42
The article mentions compression briefly. They are using basic run-length
encoding.

There are probably better algorithms. GP suggested Huffman but I don't think
it will work. Something like the algorithm used by crinkler [1] (context
modeling + arithmetic coding) may work better. It can be made even better if
combined with the encoding (the base doesn't have to be a power of two).

Anyways, no need to overdo it, the author's technique works. At least for the
newer base2048 version, optimized for long tweets

[1] [https://xoofx.com/blog/2010/12/28/crinkler-
secrets-4k-intro-...](https://xoofx.com/blog/2010/12/28/crinkler-
secrets-4k-intro-executable/)

------
iflp
Make sure to check out the game. Best tetris ever.

[https://qntm.org/files/hatetris/hatetris.html](https://qntm.org/files/hatetris/hatetris.html)

------
frou_dh
On the other end of the scale, we have base-1, e.g. scratching a row of 1's on
a surface as a tally.

I guess base-0 can't really exist, since 0⁰ is undefined.

~~~
castratikron
Fun fact: As the base goes down, the number gets longer. If you think about
the base as the height, you could then find the "area" of the number by
multiplying the two. It turns out that the base that minimizes this area is e.

[https://web.williams.edu/Mathematics/sjmiller/public_html/10...](https://web.williams.edu/Mathematics/sjmiller/public_html/105Sp10/addcomments/Hayes_ThirdBase.htm)

------
samcheng
Javascript really has come a long way.

I can remember when hacks like this were in CPAN, and then in pip. Now npm?!

------
praptak
_" To encode one additional bit per code point, we need to double the number
of code points we use from 65,536 to 131,072."_

I'd argue that going into fractional bits per code point would not make the
project significantly geekier than it already is :-)

~~~
lifthrasiir
Well, if you really want to put more data to the tweet you can always abuse
t.co [1].

[1] [https://xem.github.io/obfuscatweet-
reloaded/](https://xem.github.io/obfuscatweet-reloaded/)

------
Zamicol
Related: An "arbitrary base converter" that treats all strings as numbers:
[https://convert.zamicol.com/](https://convert.zamicol.com/)

------
p2t2p
Reminds me of Jira. Jira (at least server and data centre editions) uses base
64 numbers to impose global ordering onto issues.

------
captn3m0
Has anyone tried to tweet compressed data in any of these encodings.

Wondering if there any famous pieces of text that would fit.

------
spiritplumber
I like Base1. Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

------
intricatedetail
Using a platform to do something it was not designed for could constitute
hacking and therefore getting you into a legal trouble.

~~~
userbinator
A platform designed to disseminate data is being used for its purpose. This is
really no different than switching from English to Chinese to increase
information density: [https://pugs.blogs.com/audrey/2009/10/our-paroqial-
fermament...](https://pugs.blogs.com/audrey/2009/10/our-paroqial-fermament-
one-tide-on-another.html)

