Hacker News new | past | comments | ask | show | jobs | submit login
Anguish: Invisible Programming Language and Invisible Data Theft (perl.org)
225 points by labster on May 20, 2016 | hide | past | web | favorite | 59 comments

Unicode has stubbornly refused to become Turing complete. Will it ever be more than a bunch of character tables?

I think Unicode Emoji were a great step forward but we must redouble our efforts.

Unicode could be a stack machine. Implementing bi-directional text already requires a stack which can be pushed onto and popped from (https://en.wikipedia.org/wiki/Bi-directional_text#Unicode_bi...), we just need to add more operations!

Fascinating. The emoji seem primitive in comparison.

This is actually related to something I'm working on at the moment and it cleared up a few misconceptions. So thanks for the link :-)

A turing machine can be simulated with two stacks.

I genuinely can't decide whether or not this is sarcasm.

It's sarcasm.

Someone needs to make a script which runs through a git repo's commit history and looks for commits which add invisible Unicode characters. Maybe some existing exploits could be found in the wild.

Here's a script that detects the Anguish characters (except for the Byte Order Mark \ufeff, which isn't that rare):

  git log -p --pickaxe-regex -S"$exp"

Some popular invisible characters are space, tab, and newline.

None of which are invisible in the same sense as those covered in the article...

I am responding to parent comment, not to the article.

Spaces, tabs, and newlines can be used to alter code in invisible way.

Lua, or at least LuaJIT, allows these characters to be used as identifiers, which has led to some pretty interesting looking obfuscated code: https://facepunch.com/showthread.php?t=1463260&p=47712658&vi...

The following code uses this technique. It's valid Lua. If you highlight this and copy it to your system clipboard, running `pbpaste | luajit` will count to 5 three times:


      for ‎‏=1,5 do


Hmmm, running pasted code that is explicitly obfuscated from the internet probably isn't a great idea.

true, but in this case, the invisible characters are simply names and the rest is verifiably benign code. It's not like we're copy pasting a binary or anything.

I've been thinking of ways to make it seem like verifiably benign code, while doing something "interesting."

For example:

This is benign, but that's not an empty string. The string contains a bunch of U+200e and U+200f characters, even though it appears empty. It proves that you can have strings with invisible characters in them.

Since we have two types of invisible characters, U+200e and U+200f, we can use those as binary digits -- 1 and 0. Thus, we can write a function that takes an invisible string as input, and returns a normal string as output.

So, what kind of string could we feed it? One possibility would be to convert something like "echo 'command-line injection'" into an invisible string. We'd pass that into our decoder function, and pass the result into os.execute. Since the conversion function mentioned above can be identified with an invisible variable name, it would look similar to this:

That looks very suspicious, but we can do better. In Lua, you can index into tables with strings. And we have a function which can take invisible strings and produce normal strings.

The final PoC could look similar to this:

That's non-working code, as I haven't put this together. But the idea is to convert the following (working) code into invisible strings:

  _G["os"]["execute"]("echo 'command-line injection'")
Making this work is left as an exercise for the reader. :)

Another interesting approach would be to iterate through the "os" table a fixed number of times, until reaching the "execute" key. The iteration order isn't guaranteed, but given a certain version of LuaJIT, I think it's stable. That means you'd be able to do the equivalent of "os.execute" while making it look like you're "counting to 5."

Indeed not, but it's pretty amazing that it's copy-pastable via HN.

Ignorable (invisible) unicode characters have caused security vulnerabilities in the past, especially on HFS+ filesystems running on OS X (due to normalization):

* https://git-blame.blogspot.com.es/2014/12/git-1856-195-205-2...

* https://www.cvedetails.com/cve/CVE-2013-0966/

The language is perfect for literate programming. All human-readable characters are comments by default, so you can write completely human-friendly prose, and make the machine-readable content invisible to humans.

Or you could put the comments in the invisible fork and pass a Turing test.

'I passed the Turing test. No one believed me. Honest.'

passes Turing test by sounding like a petulant child

It might be useful to have an option to turn all the "non-obvious" Unicode characters into the form of <U+XXXX> in programming editors. One concern is that some legitimate text will break, but it would be worth it considering most code is written in English anyways.

It's trivial to convert between this and regular Brainfuck syntax: just replace characters. The article even gives an example using a Perl one-liner :)

Revision: I believe I misinterpreted the intention of your post, instead wanting to expose tricks like these. I'd be fine with this.

Jetbrains: "Zero Width Characters locator" plugin (https://plugins.jetbrains.com/plugin/7448)

This is the next generation of shell code right there.

Why try to obfuscate programs in base64-encoded strings when you have it invisibly lying around in plain light.

Here's one I made earlier:

    alert(String.fromCharCode.apply(null,String.fromCharCode.apply(null,"​‌​​‌​‌​​‌‌‌​‌​‌​‌‌‌​​‌‌​‌‌‌​‌​​​​‌​​​​​​‌‌​​​​‌​‌‌​‌‌‌​​‌‌​‌‌‌‌​‌‌‌​‌​​​‌‌​‌​​​​‌‌​​‌​‌​‌‌‌​​‌​​​‌​​​​​​‌​​‌​‌​​‌‌​​​​‌​‌‌‌​‌‌​​‌‌​​​​‌​‌​‌​​‌‌​‌‌​​​‌‌​‌‌‌​​‌​​‌‌​‌​​‌​‌‌‌​​​​​‌‌‌​‌​​​​‌​​​​​​‌‌​‌​​​​‌‌​​​​‌​‌‌​​​‌‌​‌‌​‌​‌‌​‌‌​​‌​‌​‌‌‌​​‌​​​‌​​​​‌".split("").map(function(c){return(c.charCodeAt(0)>>2)^2098})).match(/.{8}/g).map(function(c){return parseInt(c,2)})))

Hmm. Readability is high on this one.

Yes, very transparent!

It also has the interesting property that all the programs are quines except those that print themselves.

In other words, the programs are quines if and only if they aren't.

Androids dream of quined Anguish.

Wow, that's truly evil. Would other languages like Ruby that support overloads like that be susceptible(I'm no Ruby expert)?

You can name methods invisible unicode characters so calling them is basically invisible. Quick example:


You don't need to use define_method, it just makes it more obvious what's going on.

It uses "U+FEFF ZERO WIDTH NO-BREAK SPACE", also known as "BYTE ORDER MARK" -- which means that an Anguish program that starts with that character might not survive translation to or from UTF-16.

At first glance, I'd consider it a (security) bug in Perl 6 that it permits tokens containing invisible characters, let alone consisting solely of invisible characters. Are there any other languages with this behavior?

As a random example:

    titan:~ geofft$ python3 -c "$(printf "\u2063") = 1"
      File "<string>", line 1
         = 1
    SyntaxError: invalid character in identifier
If you change it to e.g. 00e9 ("é"), Python 3 permits the character, so it's not just a lower-ASCII thing.

https://rt.perl.org/Public/Bug/Display.html?id=128159 was filed by the author of this piece. The take-home to me is that supporting Unicode to a great depth in a language is really hard.

My company's large Java codebase has hundreds of zero-width spaces in the middle of method names. It's sad.

What's the reason? Were the method names needed to be aligned somehow?

"invisible" is entirely dependent on your text editor

What about carriage return. newline,form feed, etc? Those are invisible character, and those are just plain ASCII.

Those aren't invisible, merely transparent: they take up space, and alter the location of the cursor when inserted. Invisible characters don't.

Also, some text-displaying programs will insert glyphs for them, to make them visible, which would make them somewhat more detectable.

And the BEL character, while non-spacing and invisible, is sometimes audible.

Those aren't valid tokens. "<newline> = 10" won't assign to a variable named <newline>. If the language wants to parse non-ASCII invisible characters as whitespace, or permit them inside comments or strings, that's fine.

git makes Anguish code look more readable than the rest of the file: http://imgur.com/AHavQor

This look like vi, which is probably the default $EDITOR, git defers to.

This is delightfully evil

"Hush" immediately came to mind as a useful name for an invisible language.

This is probably a stupid question, but could somebody explain to me why one would add invisible characters to Unicode?

For instance, Zero-width spaces and other word-break characters can help reasonable text layout, but should be invisible. RTL and LTR marks help rendering text of different directionality, but obviously need to be invisible.


Because a character encoding standard is considered worthless if you can't make ANSI bombs with it...

Zero-width unicode chars have been used in exploit kits for a while now; just use hd (or something similar) when debugging.

Where is this more dangerous, on the web or in Github and open source programming?

Buries the lede. The interesting part is the abuse of invisible characters to sneak malicious code into pull requests.

The language is just a cute transliteration of brainfuck to use invisible zero width characters.

Yeah, I was like 'well, that's kinda amusing' and then I saw the screenshot of the diff and went 'AHHHHH!!!' The worst part is, it's totally plausible. 'What's with the whitespace changes in your patch?' 'You had some trailing whitespace in there so my text editor automatically cleaned it up.'

Diff views probably should replace invisible characters with visible place-holder glyphs. Is that something that can be done on a font level, or does it require extra code? As in, can I assign a glyph to an RTL mark and have it automatically show up?

Since the article's title doesn't have that weakness and should have been used anyhow, we'll use it. (Submitted title was "Anguish: A language written in zero-width characters".)

Anyone who uses 'lede' correctly gets express treatment here...

I chose that title because "invisible" has a lot of meanings and "zero-width" is more precise here. Brainfuck is an invisible language -- and so is COBOL because so few people think about its widespread use today.

True, but your title leaves out the important part, the data theft POC.

> Anyone who uses 'lede' correctly gets express treatment here...

is that a hint of weariness? :3

This is nothing more than a string replace on top of Brainfuck.

In the first half of the article. The real reason for the post is in the second half, and the actual scary part that could affect developers.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact