Hacker News new | past | comments | ask | show | jobs | submit login
Obfuscated Tiny C Compiler (bellard.org)
381 points by ducktective 64 days ago | hide | past | favorite | 124 comments

I thought I could follow at least some of the unobsfucated C, and then this monstrosity:

    t = "++#m--%am*@R<^1c/@%[_[H3c%@%[_[H3c+@.B#d-@%:_^BKd<<Z/03e>>`/03e<=0f>=/f<@.f>@1f==&g!=\'g&&k||#l&@.BCh^@.BSi|@.B+j~@/%Yd!@&d*@b";
Could someone please explain what is going on in that routine?

Based on the unobfuscated source, this is a lookup table of some kind for operators, encoded in a string.

  ++   -> tokl=11 tokc=0x1
  --   -> tokl=11 tokc=0xff
  *    -> tokl=1 tokc=0xc1af
  /    -> tokl=1 tokc=0xf9f7
  %    -> tokl=1 tokc=0xf9f7
  +    -> tokl=2 tokc=0xc801
  -    -> tokl=2 tokc=0xd8f7
  <<   -> tokl=3 tokc=0xe0d3
  >>   -> tokl=3 tokc=0xf8d3
  !    -> tokl=2 tokc=0x4
  *    -> tokl=0 tokc=0x0
For example

  ^^       operator (padded with @ when one character)
    ^^^^   tokc (variable length)
        ^  tokl

T is being assigned a string value that looks like line noise. I assume it's Perl.

It's funny to see the Perl comparison when APL-family languages normally have code that looks like that:


Rust is more unreadable to me than Perl, in all fairness.

Haha. Thanks I needed that Monday morning laugh.

It's part of the tokenizer. t contains data in the format of [char 0][char 1][base 64 encoded token id …], whereas the id is as long as the entry is smaller than 'b'. So it's ('++', `#m`), ('--', `%am`), ('*@', 'R<^1c'), and so on. '@' is used as a wildcard. As you can see this part parses operators.

Looks like a type definition in Rust.

If it doesn't end with .unwrap() then it isn't Rust.

true :(

This made me smile :-)

It's looking for operators. That string encodes the operator to look for, and the values of tokl and tokc that will be used if that operator is found.

The first entry looks for "++", and if found sets tokc to 1 ('#' - 'b' + 64) and tokl to 11 ('m' - 'b').

The second entry looks for "--" and if found sets tokc to 255 and tokl to 11.

The third entry looks for "*" and if found sets tokc to 12693263 and tokl to 1. (12693263 is 0xc1af0f, which is an x86 IMUL instruction).

I would like to look at the code after preprocessing, it is too hard to analyze the code with all those #define statements.

First thought would be "opcodes", but on a closer look the operators in there seem to be relevant. Maybe this encodes their meaning/relevant opcode and precedence?

From the website:

``` binary operators, by decreasing priority order: '*' '/' '%', '+' '-', '>>' '<<', '<' '<=' '>' '>=', '==' '!=', '&', '^', '|', '=', '&&', '||'. ```

The unary operators ++ and -- have a higher precedence, so they come first. Then there are "*@R<^1c", "/@%[_[H3c", "%@%[_[H3c" (same as for '/', though), "+@.B#d" and so.

I did not have time to read the source (my compile finished a minute ago), but maybe you can now figure out the rest? :)

//edit: ha, edflsafoiewq and xashor we're a bit faster and more thorough. But nice to see we agree on the overall idea.

Isn't it obvious? /s

If I recall correctly the maximum binary data you can fit in the largest QR code is 2,953 bytes. This C compiler can fit on a qr code.

True, although "the largest QR code" is substantially more dense than most people have ever encountered: https://en.m.wikipedia.org/wiki/File:Qr-code-ver-40.svg . People might not realize that QR codes scale up and down very gracefully; most QR codes that people have ever seen probably encode less than 80 bits.

Reminds me of a project a friend and I took part in, sending each other source code via postal mail. I ended up sending a QR code printed on film negative [0] to pack about 1k of bytes. Bellard's code is way more "information-dense" of course..! :-)

[0] https://greg.technology/img/postal-codes.jpg

Edit: incredibly, pointing the iPhone camera app at the low-ish resolution photo of the QR code above works -- the QR code does decode. That's crazy.

This reminded me of Richard Feynman and his first wife driving military censors bonkers by sending each other puzzles.

That's an incredibly kind comparison. Thank you! Do you have a reference where Feynman talks about this? I'm finding this [0]:

"Arline entered the nearby Albuquerque sanatorium, from where she wrote him letters in code — for the sheer fun of it, because she knew how he cherished puzzles, but the correspondence alarmed the military censors at the laboratory’s Intelligence Office. Tasked with abating any breaches to the secrecy of the operation, they cautioned Feynman that coded messages were against the rules and demanded that his wife include a key in each letter to help them decipher it. This only amplified Arline’s sense of fun — she began cutting holes into her letters, covering passages with ink, and even mail-ordered a jigsaw puzzle kit with which to cut up the pages and completely confound the agents."

[0] https://www.brainpickings.org/2017/10/17/richard-feynman-arl...

He talks about it in more depth in Surely You're Joking, Mr Feynman, in the "Los Alamos From Below" section

Hiya, how did you print these on negative film?

In Montreal, the really really great people at Borealis [0] made it extremely easy. I sent them a 300 DPI file by email, and they printed negatives from those files in a few days. Each negative cost about $8 to print.

Finding a lab that did this online was pretty difficult at the time. If others have recommendations -- please chime in! Otherwise, I would strongly recommend considering Borealis (who I'm sure can ship the negatives by mail -- otherwise reach out and I can facilitate the mailing as I live here ha)

[0] http://www2.borealislab.qc.ca/borealis/en-home

You drill a hole in lead plate, take uranium rock, rasberry pi and...

Why were you doing this?

It was an art project -- called "Postal Codes" -- where an artist friend and I would correspond by creating sketches (typically using p5.js or three.js) and sending them by "snail" mail. There is a bit more information here [0], here [1] and here [2].

[0] https://www.instagram.com/p/B9MhMzKHVxJ/

[1] https://www.instagram.com/p/B7wL7MKly3k/ (description in Spanish and then English)

[2] https://github.com/gregsadetsky/postal-codes (project description, "rules", and source code)

Is the an annotated unobfuscated version ?

It started as that but it's since turned into completely usable C compiler that supports 32-bit and 64-bit x86 and ARM platforms. There are still developers adding to the project every day.



I absolutely adore TCC, fwiw. It’s the base project that this one is derived from. Like all of bellard’s work, this is legit and also extraordinarily high quality. At one point before I got bored I basically had a game engine that took C source code as scripting inputs. Totally stupid, and worked great thanks to this.

> [TCC is] the base project that this one is derived from

The relationship is actually the other way around, at least according to this page:

"NOTE: My other project TinyCC which is a fully featured ISOC99 C compiler was written by starting from the source code of OTCC"

Oh! Thank you for the correction!

I wasn't sure which was which, but TCC is so lovely. It's cool to hear some of its history.

Speaking of mad genius ideas, my mind was blown by https://bellard.org/tcc/tccboot.html a boot loader able to compile and boot a Linux kernel directly from source code.

I love his JSLinux that allows you to boot Linux (or DOS or Windows!) in your browser. https://bellard.org/jslinux/

What do you think about chibicc in comparison? It supports C11.


Neat! I haven't tried it, but it looks interesting. Thanks.

I wonder does Fabrice hang-out on HN, Reddit or some chat channels?

I very much like to know his views on topics like Rust, RISC-V, Blockchain etc... I'm curious about his workflow, dev. machine, preferred distros, editors and the like :D

An AMA or a technical interview done by someone knowledgeable in the field would be awesome!

He has his own emacs clone named QEmacs[1] which I think he uses personally. He uses Subversion as his version control system (at least in developing QuickJS) according to this post[2] by him in quickjs-devel mailing list.

[1]: https://bellard.org/qemacs/

[2]: https://www.freelists.org/post/quickjs-devel/Please-consider...

Thanks! Good to know! I knew about his Emacs clone...I think Linus Torvalds uses something similar.

Linus has his own port of the venerable MicroEmacs[1]. As does Walter Bright[2], by the way.

qemacs is actually quite advanced, and even includes a basic HTML renderer (for the help). I was half joking that with quickjs, one now can make a reasonably modern web browser out of bellard-ware…

[1]: https://github.com/torvalds/uemacs

[2]: https://github.com/DigitalMars/med

If you find this level of density interesting, it may be illuminating to know that there are programmers who work with code that is as dense if not more so, every day as part of their job, not even trying to be dense or win the IOCCC.

Of course, I am referring to the users of APL family languages:



If you are not ready for APLs, then you can inspect this, not "obfuscated", but still dense, self-compiling C-subset JIT:


To be personally Fabrice Bellard is the Leonardo da Vinci of technology with specialization in C.

This guy knows how to push the language's limits and he is the main reason I switched back to C for good.

*to ME...ugh, all these typos lately /facepalm...

I really like this kind of minimal programming. If you like this check out /r/tinycode

When I look at Knuth code snippets in CWEB I get the same vibes. He has a full implementation of garbage collecting zero suppressed binary decision diagrams that he used to count various stuff and it’s insanely elegant and minimal.

The way I approach programming in general is to put a ton of thought in up front about how to do the least amount of work possible. Is there a very simple elegant way to do this? Can I collapse it down even more? Could I do it in just one or a few source code files of reasonable size?

For some reason the thought that I will never be as productive as himself brings me peace. Huge respect. I remember looking at his entry in 2002 like it was yesterday, that was a pleasurable mind fk.

Fabrice is a genius... how he manages to build all such stuff he builds?

Lack of distractions, doesn't care about fashion, intimate familiarity with tools and deep mathematical insight combined with hardcore coding skills.

I'm sorry if this is a stupid question, but why wouldn't the non-obfuscated versions be able to compile themselves?

It's literally explained in the first line on the page, where he says he wrote it "in order to win the International Obfuscated C Code Contest (IOCCC) in 2002."

So it's not that it wouldn't work, but it obviously wouldn't have worked for that contest.

Without having looked at the code I'm guessing he used unimplemented language features in the non-obfuscated version, then re-wrote to avoid using them in the obfuscated version.

TCC, the actually useful and non-obfuscated version can compile itself of course.


But the fact a C compiler can compile itself is not very impressive. Making it fit into 2048 bytes of obfuscated code is.

I wonder if TCC could be used in a JIT.

I did in my music/generative art app, where I compiled math expressions into native code using TCC as a backend. Actually, it was _so_ much easier than dynasm and other JIT libraries, and so little code, and yet, it was a great performance boost already.

Yes, it's an officially supported use case.

I knew Bellard had written a very functional Tiny C compiler. I had no idea he wrote it more than once. My respect for him just continues to grow.

Many years, and still impressing. I wonder if similar projects exist for other languages (preferably, without a use of `eval()` and friends).

I wonder if someone already modified this so it also runs on Windows & Mac.


is it practical in any way?

Suppose you want to defeat a "Reflections on Trusting Trust" attack and bootstrap a toolchain source code you believe is clean to produce binaries that are clean. You need to compile llvm/clang (or gcc). You start with something small you can audit manually, use it to build binaries that can compile llvm, then use llvm to build itself to get an optimized compiler that doesn't have malicious code to recognize the password check inside the login binary and emit machine code that is not faithful to the source code.

Interesting! Still, I believe this doesn't prevent malfunction resulting from a compromised silicon...Like a sequence of instruction executions or memory accesses resulting to sending over a packet via the network.

I wonder how can one prevent that? Even an open-source ISA could not be fully trusted since one can't know what is implemented on the actual die on hand.

It's true that the silicon is hard to audit, but it's also pretty static, and there's a huge variety of pretty effective obfuscation that can live on top of the silicon. The

You suggest that an adversary could use some compromised layer to send a packet over the network when something happens. For example, someone running `aesenclast xmm15, xmm10` could also secretly trigger a socket()/send()/close() syscall set that would transmit a packet over the network adapter. But what if the target code never calls the AES-NI instructions? A user could easily be using a library that processes the encryption in software; a simple ISA filter has an impossible task to determine which xor instructions are cryptographic and which are just ordinary compiler output. You could be running a virtual machine, in which you've loaded a browser Javascript engine, and using that Javascript engine to run a virtual machine (https://bellard.org/jslinux/), in which you're running a Python instance, which is finally performing the encryption.

Sure, if all those layers were efficient, they'd just pass it down the stack to eventually call `aesenc`, but modern software is anything but efficient. Yes, if your deeply nested Python calls "sum = num1 + num2", the x86 `add` instruction likely gets invoked somewhere, but to predict when some client code is handling a secret key or when some user input is actually a password seems really difficult.

tl;dr of a pamphlet I decided not to post: If you're [seriously] considering anything beyond "my firewall is strictly unable to report it", you're either building something like nuclear weapons or taking the first steps to sliding down the rabbit hole of paranoia, ending up in tinfoil-hat-wonderland ;-)

The topic is interesting from an academic point of view. Maybe something along "On trustworthy, verifiable execution on untrusted, adversary hardware". Like homomorphic encryption, but I think having the user supply the input would make for a difficult programming+debugging cycle :P

Anyway, the goal of this compiler here isn't to prevent ANY kind of attack. As the page says, it is "only" a tiny C compiler, and an obfuscated one at that. The "tiny" part makes it much easier to audit than say GCC or clang (and you don't need to audit GNU binutils, either - and I've seen that code, so trust me when I say: You don't want to go that route).

//edit: Obviously I don't want to insult the fine folks investing their time into projects such as binutils, they deserve quite some praise - but as someone who is not familiar with the code base I found it to be very overwhelming.

How do you know that your text editors aren't compiled to hide the payload when examining this compiler, thus perpetuating the hidenness?

If you wanted to solve for that, you would have to build a computer like the monster6502 [1] to run the compiler on, and visually inspect the memory contents.

1: https://monster6502.com/

How do you know the person who designed that isn't in on it?

The thing is pretty easy to verify.

The ultimate question is: How far do you need to go until you believe that the earth is indeed not flat? Do you trust your teachers? Stand on a tall building? Take a balloon ride? A plane? A spaceship? Then still, at which point can you rule out that the earth just appears to be a ball due to some yet-to-be-explained phenomenon (gravitational anomaly?) but is indeed flat? (The correct answer of course is "turtles all the way down").

For some definition of 'easy', yes. You still need to know a lot about how a CPU works. I think a project to bootstrap a PC with code generated from a completely verifiable platform would be really cool, but in practise would be of limited usefulness above and beyond secure boot we already have, because you're still trusting all the binary BLOBs you can't control (BIOS/UEFI, Intel ME, microcode, etc)

At one point a variation of it could compile a Linux kernel (very fast too!) and boot from the same ISO. Of course GCC or LLVM produced better optimised code, but TCC, being so small runs so much faster.

Another practical application is using tinycc as a backend compiler for Nim [1]. I set up my nim.cfg to default to this for a rapid edit-compile-test cycle (usually under 250 millisec), with a quick define switch to move to gcc-optimized code.

Similar is likely possible for other prog.langs that emit C.

One helpful feature for the full round-trip to an executable file is a built-in object file linker.

Also, libtcc can be used as a library to "compile a string" and then run it which is the JIT application mentioned elsewhere in this thread.

[1] https://nim-lang.org

Yes, for winning IOCCC competitions! Which would normally be a great CV badge to demonstrate your skills to prospective employers, but likely of little consequence if you're Fabrice.

It's probably of little consequence for his employment, but Bellard has won a few IOCCC prices: https://www.ioccc.org/winners.html#B

> Which would normally be a great CV badge to demonstrate your skills to prospective employers

Not many employers would know of the IOCCC.

If you have the sort of brain that makes winning IOCCC entries, it is highly probable that a candidate employer who has not heard of it is not, in fact, prospective.

> Not many employers would know of the IOCCC.

Sure. But who wants to work flipping burgers, greeting shoppers, or doing web front-ends for the rest of their life?

Anyone interviewing an IOCCC winner probably has.

can confirm

source: won the IOCCC

Android once included a library "libacc", (the Android "Almost" C Compiler") which was based on the deobfuscated version of this.


It isn't as it compiles a subset of c. Nonetheless it's a gem!

He also created QuickJS, that's javascript server side, like NodeJS but much faster to start programs. Just type "qjs your-script.js". I use it in my server because of its low overhead to start and run my microservices. https://bellard.org/quickjs/

And qemu... and ffmpeg.

The productivity of Bellard is truly admirable.

And around a week ago, his text compressor topped the enwik9 (Hutter Prize) benchmark. Not prize-eligible though-- that would require a single core, non-GPU build to run in <100h. His nncp 2.1 benchmark took 140h to run on a GTX 3090.



I'm happy for him.

I also wonder his employer might be the nicest do-whatever-you-want-in-your-free-time employer on earth.

Not only that, I wonder what kind of day-job roles he has since I assume there is significant cognitive load in his personal projects (we still think he's a human not a superhuman right?).

For context, my last employer was "Oh you're doing something in your free time? let me utterly destroy you." (and they did).

No worries on that front: https://www.amarisoft.com/about-us/

Founded in 2012. That's like yesterday. 80% of his stuff was done by then.

edit: Even then, now he probably has the world's nicest co-founder instead. Who does cognitively demanding open source software projects made available for free online, while running a startup?

> For context, my last employer was "Oh you're doing something in your free time? let me utterly destroy you." (and they did).

How could they? And what was their motivation for doing this?

An employee doing something in their free time is a rare and valuable asset.

Boy I got news for you.

They also had a policy that everyone should spend at least 4 hours every week (outside of the 40 hour work-week) doing some personal project, or learning. The more the better. (it was a recommendation, not a requirement, and there was no monitoring).

Well what I had in mind was a bit more ambitious than four hours. It did not infringe on their bottom-line (I wasn't using their tech or competing with my day-job; it was completely unrelated). But for me they were like "Oh, no, no ... we were just kidding. We really didn't mean that you should do something in your free time".

The encouragement to do something in free-time is a lie! ... they know most of the engineers don't do that. But if someone actually does something, then the lie gets exposed.

Try it yourself. Take them up on their offer and see what happens.

I don't work anymore, but I'm curious of what happened.

You said they destroyed you.

Did they fire you for working on your project?

Were you harassed? (tall poppy syndrome?)

I'm not going to open a can of worms here but lemme suffice by saying this:

With success comes maturity, and a startup turns into a corrupt, org-chart, white-elephant company. 90% of the employees end up being cronies that are pushing paper. The remaining 10% are the real engine of the company (cz the revenue stream has to be maintained somehow and real work still needs to be done). But those 10% need to be kept under the thumb at all cost, made to feel they're not good enough no matter what they do; otherwise they might start questioning wtf's going on in the rest of the company. Let's just say that I was one of those 10%. They did whatever they had to do to make an example out of me. I got my freedom, but I actually feel worse for the other hard-workers who were left behind.

Is this SV? What was the field of that startup?

He worked on ffmpeg under a pseudonym. He may very well be secretly involved in other projects.

Worked on might be an understatement, given that he was the initial ffmpeg developer, and thus founded that project.

There have been many more ffmpeg developers since then. I think it was originally a mpeg1/2/4 codec only, which is something one person can do. Although hardly anyone has done it with acceptable performance like he did, usually they like to get abstract and object-oriented for no actual benefit.

damn that’s ballin, I didn’t even know people did that with code

jslinux as well, it's more of a cool demo than anything useful but the "cool" factor is super high. And led to tinyemu (which underlies the current jslinux implementation). The current version can run Windows NT inside QEMU inside JSLinux. In your browser.

Bellard is one of those 1%’ers.

Only 1%? At least one per million, if not per billion (depending on how you count the denominator).

He's not just 10x, he's 0x10x.

You’re right. Let’s say he’s a 1%’er under a linear view of a logarithmic ranking of all programmers.


I wonder if he's lonely, with not too many people in the world who can communicate, operate, execute at his level...

That reminds me, Edward Teller said "von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us."

QuickJS Main Features:

* Small and easily embeddable: just a few C files, no external dependency, 210 KiB of x86 code for a simple hello world program.

* Fast interpreter with very low startup time: runs the 75000 tests of the ECMAScript Test Suite in about 100 seconds on a single core of a desktop PC. The complete life cycle of a runtime instance completes in less than 300 microseconds.

* Almost complete ES2020 support including modules, asynchronous generators and full Annex B support (legacy web compatibility).

* Passes nearly 100% of the ECMAScript Test Suite tests when selecting the ES2020 features. A summary is available at Test262 Report.

* Can compile Javascript sources to executables with no external dependency.

I hadn't seen QuickJS before and it looks amazing. It's a small, fast implementation of JS plus wrappers for the C standard library. That standard library includes output, reading and writing files, and a lot more. It's both an interpreter and a compiler. So you can build small console apps in JS.

Do you run your microservices as CGIs or from inetd? Or why does the start overhead matter that much?

bellard is a programming genius and true C hacker. his works are so great but how humble is his website! why doesn't he write blog posts? why he has so little online presence? hmm. let us wish him a long and productive life.

>> why doesn't he write blog posts? why he has so little online presence?

>> bellard is a programming genius and true C hacker. his works are so great ...

It's probably related. Obviously it's his and only his decision, but I wonder if him slowing down to share the knowledge and skills could have been more productive in the sense that others could do similar work.

Bellard is famous for starting a number of projects that have since been carried on by others.

Looking at his output some time ago, one of the main things that I found inspiring was how much good he did by sharing his work in a form that others could build on it.

Complaining that he's done insufficient to bring others on seems really unfair.

I wonder if we could regain the level of intellectual maturity of the 19th and 20th century and not criticize those who already give away so much that they are not working hard enough.

This is literally Animal Farm / team lead "multiplier" nonsense.

And like in Animal Farm, the pigs profit.

He shares the demystified code here:

https://bellard.org/otcc/otccn.c the TCC documentation sheds further light: https://bellard.org/tcc/tcc-doc.html#devel

He is concentrating on being creative = create = work.

Valley startup hipster culture doesn't get this.

Found this article that seems to talk more about him than his own website.


"Bellard doesn’t appear to promote himself—he politely declined to be interviewed for this profile"

Self promotion is an entirely separate hobby from programming. Most people don't blog about their hobbies.

I second this

I always check the topics on Bellard here on HN, really insightful.

Let's don't forget the obligatory: "Can he invert a binary tree on a whiteboard though?"

The funniest thing about the 'invert a binary tree' meme is that it genuinely is a simple, 5-liner problem that most programmers could solve in five minutes, it just sounds intimidating.

> Let's don't forget the obligatory: "Can he invert a binary tree on a whiteboard though?"

Probably yes.

Fabrice Bellard should receive a Turing award. Who agrees?

I don't see the point. But Turing would have received the Bellard award, in a saner timeline.

Also, in a saner timeline, Turing might have well received the Fields medal.

His contributions to probability theory, and morphogenesis are amazing in addition to being the father of computer science, and codebreaker. All of the work was done before he was 40.

[1] https://www.jstor.org/stable/2974762?seq=1#metadata_info_tab...

Maybe the big unicorns who surely got hefty profits off of his work should pool together a billion dollar reward for him.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact