Hacker News new | past | comments | ask | show | jobs | submit login
Base65536 encoding (github.com/qntm)
334 points by rahiel on June 2, 2017 | hide | past | favorite | 120 comments

256 byte packet and a 192 bit authentication hash, why use fast flux dns to run C&C on your botnet when you can just make them twitter followers.

EDIT: And in case that isn't clear. Imagine you have a botnet, and all of the individual members create a twitter account. All of the twitter botnet accounts follow the 'master'. Who can tweet a command (and corresponding authentication key) to the botnet to say "follow chuck and do up to n things for him, here is his public key". Now Chuck suddenly has all these followers and when the time is right he tweets out his command, "ddos my greatest enemy" and adds his 'proof'. Off they go and blast his enemy. If he was only allotted one command then they all un-follow him.

Basically its social media for botnets.

It's funny how this is really close to when celebrities tweet about how good a brand is.

So shops inflate prices just to make it harder to ddos. Oh, shi~

Because Twitter would be able to shut down your botnet.

Conceptually I agree with you, Twitter should be able to shut down a botnet like this by simply 'identifying bogus twitter accounts' (aka twitter bots) and 'deleting those accounts en masse.'

The part where it gets weird though is that twitter already has massive botnets which run around in it retweeting things and what not. Which they do not shut down. So is that because they don't want to? Because they can't? or simply because it isn't worth their time? That is still an unresolved question for me.

I suspect it's because the Twitter bots inflate the user count of Twitter. It seems that even though Twitter's monetization is not the greatest, their valuation mostly comes from the number of users on Twitter. The reasoning process seems to be something like, "Twitter doesn't make a lot of money right now, but they have the ear of some 300 million users, and that's going to be valuable in the future when they get their monetization story correct, right?" And so TWTR prices are kept afloat. Bots probably degrade the user experience of Twitter quite a bit, but if Twitter aggressively bans all bots, their user count goes down maybe 15%[1], and what does that do to their stock price?

[1] https://arxiv.org/pdf/1703.03107.pdf

Good observations. However, user experience matters more in growing the company (it helps user growth and retention), rather than user numbers and short-term stock price

Yes - but who is calling the shots?

Wait, what's wrong with Twitter bots? I run one that tweets questions made on a QA platform, and people seem to find it useful.

Some are shitty, but that seems considerably harder to detect.

That's why you disguise the CNC commands as legitimate social media noise.

Then you can double dip - sell what the content of the command/abort messages are (or just allow for 'padding' messages for advertising in between) and suddenly your botnet doubles as a completely legitimate social media firm

Super bonus round: Launder your money from botnet activities through the social media firm's books

(I am, of course, kidding)

That wasn't just an idea. People have used Twitter for exactly this purpose for a long time, because it skews user numbers. I'd also assume that overhead is very low for c&c botnets, since its ultimately just a little pubsub message queue, so there's even less incentive to shut them down. I get at LEAST three followers a day with "buy RTs, followers" in their profile on my personal Twitter.

Spam botnets ain't nothing new. They'll ways be around.

I know you're doing a thought experiment here (and a rad one at that), but wouldn't this leave a very obvious trail of activity though?

Likely. A "non-evil" use would be setting up "IP over twitter"

Oh man, this is a wicked idea for a weekend hack.

This happened within the first year of twttr.

Yeah but what you really want is base-emoji. https://github.com/pfrazee/base-emoji

Sometimes you don't know what you're missing in life until you find it.

    Before you came into my life
    I missed you so bad
- Carly Rae

I came up with the same concept - and almost the same name[1] - as a joke at work last year, and could not stop laughing at the results, so I completely agree.

The one improvement to the public project I'd suggest is from my own. Instead of always starting from index 0, for each character position, increment the starting index by one until it loops around. That way all of the emojis get a roughly equal chance of screen time - depending on the entropy of the unencoded data, of course.

[1] I called mine "basemoji".

Oh, wow. But... the real question is how many bytes of data can I put in a tweet using this? :-)

It's right below the table:

> For example, using Base64, up to 105 bytes of binary data can fit in a Tweet. With Base65536, 280 bytes are possible.

He's asking about base-emoji, not Base65536.

>The emojis used are in emojis.json. There are 843 emojis there, but the converter reads sequences of 8 bits at a time, and so only maps the value to the first 256 of them.

One byte per emoji means 140 bytes per tweet. Since it's just a joke format they're not trying to be space efficient.

Using all 843 emojis would result in 170 bytes per tweet (ln 843 / ln 2 / 8 * 140).

Base-one was an education. I feel like base one should have some error correcting ability to take care of slippage events.

Just include a length field with the message. The message length contains enough information to repair any possible corruption to the message.

the value of the length field must also be expressed in base-1. also, the total length reported by the 'message length' header must included the space taken by the header itself as well.

I think I am missing the joke

> the "joke"

It's "hilarious", you're missing out on "a witty insight" about "valuable programming skills".

alt post: it's as much of a joke as xkcd's "I understood that reference" performance pieces.

"Unnecessary" usage of "quotation" marks doesn't "make you" seem "enlightened", "it 'just' makes" you "sound" ""pretentious"".

The quotes are what other people have said. Jokes need to have a punchline and evoke a humour response in the audience; the references to base-1 in the ancestor comments do neither, so describing them as a 'joke' is inaccurate (hence the quotes to mark it as verbatim).

I certainly agree that overuse of quotes makes a comment pretentious!

Maybe you're not the intended audience then? Personally it got a smile out of me, and I do find some xkcd comics funny. If you don't, that's fine, but it doesn't make them not jokes.

I find that the barrier they present to understanding for people unfamiliar with "geek culture" is elitist and exclusionary. The emotional response they evoke in the viewer is based on, "I know something that other people don't", which doesn't match my experience of actual jokes from comedians.

I just went to see what you meant, and the current front-page comic on xkcd[1] doesn't convey that to me at all. Fudging statistics isn't part of geek culture (even politicians do it :P). I went back and of the past 10 comics, you could argue 3 of them reference "geek culture" so heavily you can't understand the joke unless you know what they're on about.

I'm not going to pretend that it's not true (after all, the author is a geek and wants to share that with other people -- there are plenty of comics for "non-geeks"), but it's definitely not true of _every comic_.

> The emotional response they evoke in the viewer is based on, "I know something that other people don't"

Personally, the xkcd comics I find funny are the ones I relate to (the same with any joke). In particular, this one on Machine Learning[2] was especially funny to me because someone in my research group recently started looking at neural nets to analyse stellar spectra and we had a similar conversation with a similar conclusion.

Just because very few people might relate to a particular comic doesn't mean that they relate to that comic because they like feeling superior (they might, but that's not what the comic is trying to do).

[1]: https://xkcd.com/1845/ [2]: https://xkcd.com/1838/

> the xkcd comics I find funny are the ones I relate to (the same with any joke).

lol good one

Wait, are you a chicken? Is the classic galline joke not funny to you because you can't relate to a bird?

When you talk about finding that xkcd funny because you could relate its situation to a personal experience, I feel that was evoked from the feeling of nostalgia/coincidence; that certainly can be funny, but it's definitely not a joke.

Take the "people under the orange sun/red sun" joke from My Three Suns[0] (Futurama S01E07, 1999): it's not funny just because it's a callback to the same joke in Homer and Apu[1] (The Simpsons S05E13, 1994); or because it's satirizing the derivative "white people/black people" comedy routines (e.g. Eddie Griffin[2]) inspired by the set from Richard Pryor: Live in Concert[3] (1979).

It's funny because the comedian walks in a funny way, and talks in a silly voice. Very few people I know would walk & talk like this, yet (even without being able to relate to it) I can enjoy its humour. The multi-layered reference make it a deep joke, and the timing/acting/context make it a good joke.

My issue with xkcd is that it only uses the reference part, seeing that many great jokes include callbacks to other jokes, but missing the "being funny" basic requirement of a joke.

[0]: https://youtu.be/EZe7z73jKj8 [1]: https://youtu.be/L104LViQeIw [2]: https://youtu.be/o_RZusRfuw4 [3]: https://youtu.be/RL8Rru-lFmg

Again, people find things funny for different reasons. You prefer to focus on the _method_ of telling a joke (timing and so on). While I definitely enjoy such jokes, the jokes I find really funny are the ones that actually relate to my life (or contrast with it) -- or even more broadly the ones that make me think. That's just the kind of humour I'm into.

Here's some counter-examples:

* Most of Brazil is funny for many different reasons. Yes, it has references to 1984, and the style and acting are very necessary to make the jokes work. But everyone can relate (in some way) to the extremely over-blown life of the protagonist -- someone who is stuck in a system working a job they hate with endless bureaucracy. What makes it funny is how blown out-of-proportion it is and how transparent the internal inconsistencies are.

* Rick and Morty is a complicated subject to approach (it has many, many different layers of humour), but some of my favourite jokes come from cases where Rick or Morty reference things that I e with. For example, the whole "inception is so hard to understand" concept was part of a very funny quip where Rick tells Morty that "he doesn't have to impress him" when Morty says "inception wasn't hard to understand". There are many other instances of that.

Again, I don't understand your point. Why are you arguing about what's funny? I thought we concluded a long time ago that humour was subjective. Why are you telling me what should and shouldn't be funny?

For example, I don't think that clip was very funny. But that's just me.

> You prefer to focus on the _method_ of telling a joke (timing and so on)

No, I don't. That's something I mentioned that makes a joke good, but you're ignoring what I said makes a joke a joke: the silliness leading to a catastrophic collapse of a pre-conceived understanding into a different but plausible model.

You seem determined to be right about this. Again, note that I do not claim that these kind of performances aren't funny (since that is a matter for the beholder); but that they are not jokes, and their barrier to understanding is exclusionary and elistist.

> I'm not saying they aren't funny, but they are not jokes.

You might have a different definition of joke to me. Is a joke not something that is said or done to cause amusement or laughter in the audience?

> the silliness leading to a catastrophic collapse of a pre-conceived understanding into a different but plausible model.

That sounds like a _type_ of joke, but it's quite reductionist to claim that "all jokes must be like this".

Much like claiming, "all things I find funny are jokes"

the structure of this comment train makes me very happy.

It actually could be interesting, if it only wasn't so (needlessly) inefficient.

'Inefficient to run but quick to code' is my motto.

If that includes easy to understand, YOUR HIRED!

Oh, it absolutely does!

...wait, you mean, like, for other people?

... you mean, for me a week after I wrote it?

See also:


Twitter characters can actually store up to nearly 31 bits each, if you’re using the JSON API. (Or at least, this was true in 2010. I don’t know whether this is still true.)



Base-122 encoding is 87.5% efficient in UTF-8, better than anything listed in the base65536 repository’s comparison table.

From the conversation in the hacker news link, it looks like base122 gets the increased efficiency from using unprintable control characters, which is incompatible with base65536's explicit goal of only using printable non-whitespace characters in its output.

I think they missed a great opportunity to call it "base64k" encoding.

I'm the one who made the C / UNIX Shell implementation - it was a fun and quick thing to make.


I'd appreciate some feedback.

I don't seem to get the efficiency table (or how efficiency is defined here?). Since Base65536 encodes 16 bits, why can't it encode UTF-16 with 100% efficiency? It says the efficiency is 64% instead.

I'm sure it's true, just curious why.

Not every possible 16 bit value is itself valid UTF-16 - some need to be represented as two UTF-16 code-units, aka "surrogate pairs", 4 bytes total.

Not all of the code points used are (or can be) in the Basic Multilingual Plane. This means that when encoded in UTF-16, they come out to 32 bits, not 16. This skews the average number of output bits per input bit upwards.

Basic Multilingual Plane always make me think of Cthulhu... it lives beoynd

At first I thought this was going to be a joke, then I thought it was going to be stupid, but it's actually brilliant.

I think it's all of those.

You could expand the encoding further if you didn't restrict yourself to a whole number of bits per character.

> These shortcomings are expected to be fixed in Base65538.

I fricking love qntm.

Ummm… 1 dollar Bob!

(anxiously excited)

I hate this game.

Manage to make 1 point at 𤄻𣺻𣼋耈𣺻興𣼫兊𠨋𢪄𡚻𡢁𢙌𢚻𠛀𣪻栌𤄋𤯄𤆻𤆠𠞠𤪇𤆻𠙀𤅴𤆧𣪤𡚻𥪹炌𤆀㶸聙𡊰𠨌𡪻𤇅𤆀薠嫊䂔𔔌𥩋㲼耈𠊁繈倘𤨸𣾔㼬𤚱𢩋𣿋𡉌膹敃ꎹ𡩋肐𠝒𠚬醸聛㰩


Four points: 邇𤆻肹㾸𣾻㰈㼈𤆴僃𣊻𤆄肗𠪠𤆄㾻𠢻𤆻𤅶綻𤅋𣺻𠨰𤆄𤦴𤄫疐𠶐𤅴肹𤆰䂸㼈䂺𤄋𤅴肺𥆐䂹𤆄栌紌𤇀𤶔𣽛𤅌畜𤇂沃𠫄㲕脋𤅵𤄳𤶄𢩛𤇄昤㲺耈𠘱膀㢹𠷄絋𣝌𠥀𠘕𤪰炳𤶐腺䁋䀄𤨄𡈋ᖠ

You're a genius.

Since when did people start to label C implementation as "Unix shell"?

Yeah, it's weird. If the PHP repo had code to launch a web server, it would be the HTTP implementation instead.

It's labeled for where you want to use it. If you want a Python library (say for use within your app), there's the Python version. If you want to use it in your shell there is a version for that (written in C).

Even if it's technically incorrect, I don't think it's unreasonable. When they're saying "a Ruby implementation", "a Node implementation", etc., they really mean an implementation callable from within those languages.

The C implementation is not a callable library, it's a binary. So in the same sense, it's a "Unix shell -accessible implementation".

Well, no. Or at least, when I say "a Ruby implementation", I mean it is implemented in Ruby, not callable from Ruby.

When I refer to a Python C extension, I refer to it as a C implementation, even though the express purpose is to be callable from Python.

I can call Perl (and Python, and Ruby, and Javascript, and...) scripts from the shell, too. So those are now "Unix shell" as well?

I don't follow where the shell enters this conversation at all.

How exactly do you expect to invoke the C implementation? It's not a library, it compiles to a binary.

IMHO, the environment a utility is natively accessible from is more relevant than the language it was written in.

Same way you invoke a ruby script, python acript, jvm call, whatever: with exec. Scripts are just the easiest.

That's a stretch. Consider "Swift implementation" v. "iPhone home screen implementation"

Maybe it's just that I'm hungry, but I find it strangely annoying.

So, anyone got any good HATETRIS replays? I'll edit my post if I find myself getting a good one. https://qntm.org/files/hatetris/hatetris.html

Edit: If you didn't look at the repo, this encoding was made to post HATERIS replays on Twitter.

Edit: Only 3 points so far 𤆂𤆻𡚻𤆥㲺着遈𥮸㼉𤄛皲𤆻孈𤇆𡊾缎𓍌𤂻职𢪻郇膻𤅋𠅌傺𢊰䡪𤇄𤪤𡪻ꋇ𥆸𤶹膺𢡋聜𠆬𤪄膹𠬋㿄𠘬臀㾤冹𣾻𡈰𠭀䂹𤄔㼌𤚐𤢰𢢻𤇀𤞁䂺㬅𢉋𤮹㼆𣛄𡫀𤚒㡋𤢀ᖠ

I managed 4 by stacking vertically as quickly as possible and then trying to fill out horizontally.


Dang, I thought for sure I could get more than 4 with that tactic. ꉌ𤆻遜𣪻𤇇ꎷ𣹋郇ꎹ𠱋𣻅𤅡𡚻𤆻𤆒𠚻絀㬌ꊂ𣺻㣅𤮻𤪺𤂻𤇄肬𤆬𥆺𥆺𤪠𤄋𤆣𠚻胁𣚻𡫁𣚻𤇀𤙛𡚻𣛋𡛋𤇂㾺𣾔弌𣽋脧𡉫灛𤶻炼点冸𣈛䁛𠝋𠟋𤋁ᖠ

Me too! I thought I had it beat and then it through me a square: 𤆌𡊻𣾻𤇋𤆌𤆼𣾌𠳋𤇅𡊻𤅻𠆬𡊻聩𤶸𡊻𣉻𠝻𤇁𡊻𤄻炬𡊻𤀻倜𤦸䂺𠨫𤇄𠚻𤇀𡊻𤅛邬𡊻𤁛𤇅邌邬𤚁𤆄䲼𤅑𠆜邌𤪀憸𡈋𣸛𠯁𣈛㰈𤴊𤮸𤚸𣢄𤿠𡧅𢘍

I got to 8 with a very similar tactic: 惃𣺻怌𢚻𤇂䂸𤀫䀌㼌䂹𢈫𤦀𤄫悸𤆔纹㴈聘𢚻𠆌𣚻𓋇𣾻𔗀𤊻ꈇ䂹𤅷𔒌膸𤅻羱繨𢘋傺𢪻𤆻𣺻𤄛𤆔㾺傺肻𤆻𤆡𤆤𤆐𤆑暌䂻𡘫𣯂𤃅𣿇𤅋𤆱𢙋傺𠆜𤢀𤆴𣺡𣫇𣢰𤾳

Not mine but the Github page has a replay that scores 30 points:


Crikey, I thought this was going to be a joke project, but it isn't. Is it..?

Either way, it's a neat piece of thinking.

It it a solution for posting HATETRIS scores, I think it is still a joke. Just a joke with a valid technical solution, and an above average level of dedication.

Maybe it's art? I'd sat hateris is.

If you really want to put binary data on Twitter, why not encode it in an image? You could probably get several tens of kilobytes of binary data reliably encoded in a JPEG of the maximum size Twitter allows.

Twitter seems to allow PNGs so might as well use that instead and not run into JPEG compression issues.

Twitter will mangle your PNGs into JPEGs unless they have an alpha channel.

Aren't JPEG's lossy? I can't think of a reliable and performant way of retrieving data from a JPEG.

On compression yes but you can craft a jpeg image which has exactly the pixel data you need. It will fail if someone recompresses your image, but that would be unlikely

Not unlikely at all, if you're going to post it on social media

Twitter recompresses images, as do most networks.

Two or three bits per pixel, grayscale, should be quick and pretty resilient. A hundred kilobytes or more is easy to transfer.

Or, how about a QR code?

If you want absolutely terrible density, sure. Still beats 140 characters.

Neat. I see a lot of mention of Twitter but the first thing I thought of was packet compression. A ~50 byte packet shaves off around 20 bytes with this. Those are good savings although I haven't looked into the encoder / decoder enough to know if it's worth the tradeoff of having to translate every packet on both ends. I can also see UDP datagrams being a pain in the ass to work with when you're throwing around streams of Unicode characters.

Overall though, I like it and look forward to Base131072 being possible!

I didn't see any efficiencies that exceeded 100%. Are you counting code points rather than bytes in the unicode output size? The code points still have to be serialized as bytes after that is done, but since Twitter limits you by the number of characters, not the number of bytes, this amounts to a sort of compression of data into Twitter. On the wire it would still be an expansion of course.

For raw packets obviously not, but if:

- you're sending big byte arrays over json/xml

- you cannot switch to a more efficient medium

- (but you can make your remote counterpart adopt a different encoding)

- and you still need to maximize your throughput

then I guess you might consider base85 for utf-8 or base32k for utf-16?

JSON is UTF-8 by definition, so at best you'd be using a format that looked similar.

Oh yeah, good point (and one that I should have thought of before posting my comment). The scenario I described would actually be less efficient.

All those stats and one lingering question: whats the Weissman Score?

What do you propose as the standard for comparison? Base64?

Last year, I did a similar project: https://github.com/gvx/base116676

It had a feature where it automatically would try a couple of compression algorithms on the text to be able to cram even more into a single tweet.

I don't think it has a practical use, but it was fun to make.

Base 32768 has a very sexy 93.75% efficiency! Maybe I should use that with my browser game?

Going purely on the table on the linked github page, without further reading into anything at all, Base32768 seems like the clear winner for a go to scheme.

Only if your browser game requires UTF-16 strings.

More specifically: only if you will be storing/transmitting data in UTF-16. If you're using UTF-8, base85 is much better.

Javascript strings are UTF-16 (or maybe it was UCS-2?)

Javascript strings behave like UCS-2, but they can be stored in memory however the interpreter likes, and they're typically written to disk or the network as UTF-8.

Do not try playing this game. You're welcome.


The 30-rows replay is pretty impressive..

I wonder if it is mathematically optimal.

Someone spin up AlphaGo on this game

Looks like we should create some other 20k emoji.

"See a need, fill a need" (Bigweld).

Enantiomorphic tetris!

What, no Java?

a sad sign of our times... what a nonsense.

What surprises me is that this encoding was developed to allow people to share replays of an illegal, and very pathological, Tetris variant. Hackers gonna hack.

> illegal, and very pathological

Pathological, certainly; "illegal"?

Most aspects of Tetris's appearance and gameplay are protected by copyrights and trademarks owned by The Tetris Company LLC. This includes trademarks on the shape of the tetrominoes, the suffix "-tris", and the use of the Russian folksong Korobeiniki in a video game.

In their copyright-infringement case against Xio Interactive, a judge ruled that aspects of the game such as the dimensions of the game board (which the HATETRIS developer took pains to replicate) are protected by copyright.

Tetris is one of the most aggressively defended game IPs out there. Any recognizable clone of it is potentially infringing.

There are thousands of clones out there, both for-profit and otherwise, and the vast majority have not been pursued. Trademarks on the shapes of "tetrominoes" are questionable (you can't trademark a purely functional element, though you could potentially trademark a particular stylization of it or use of it in a logo). Copyright on game rules is questionable as well, and whether it'd apply in any particular case would depend on both jurisdiction and any potential fair use claims; there's a long history of cloning games, without using any of the original art or other assets. Pretty much the only bit that's more clearly problematic is the use of "-tris" as a suffix; GNOME's version was renamed from "gnometris" to "quadrapassel" for exactly that reason.

I certainly wouldn't give such aggressive behavior any unwarranted credence by presumptively calling a clone "illegal"; on the contrary, I cheer on creative developments like this.

Your summary is what the developers of the iPhone app Mino thought (they were some friends of mine; I drew the graphics for the game, and got sucked into several hours of deposition for the case to explain how I had chosen the colors from first principles, etc.). But the judge ruled otherwise, as the GP poster described.

Theoretically game rules cannot be protected in this way, but if you draw the wrong judge, too bad for you. Unless you have several years to kill and the $lots required to take a case up through every level of appeals where you still might lose in the end, I recommend against building a for-profit Tetris-like game.

Atari v. Phillips. Look it up.

Games which are "substantially similar" to other games infringe copyright. It doesn't matter if the assets and code are all original, etc. This is settled copyright law.

Already familiar with it; doesn't make it right, or justifiable. And there's plenty of case law in both directions on reverse engineering and cloning; the exact boundary would depend heavily on the details of a specific case.

In any case, it'd also be much harder to prove any harm caused by a variant like this, which is designed specifically to be un-fun, as difficult as possible, look nothing like the original, and not have any commercial interest at all. Which makes the presumption of illegality entirely inappropriate.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact