Hacker News new | past | comments | ask | show | jobs | submit login
Actually Portable Executable (justine.lol)
644 points by NilsIRL 42 days ago | hide | past | favorite | 162 comments

The author is less than enthusiastic about Apple and Microsoft pivoting to ARM. Considering the perf of the M1, this is virtually inevitable. And once most developer tool chains start supporting ARM as a first class citizen, I see no reason why we wouldn’t start running our applications on ARM in the cloud. A world with 2 architectures for mainstream use cases is the future, there’s no point fighting it.

(Unless you’re Intel/AMD in which case please fight it by giving us faster, more power-efficient chips for cheaper. Thanks!)

We achieved a near ubiquitous consensus with the x86 PC. Then APPLE said, Behold, the programmers are one, and they can build portable binaries with one machine language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do. Go to, let us go down, and there confound their machine code, that they may not run each other apps across platforms. So APPLE scattered them abroad with M1 processors from thence upon the face of all the Internet: and they left off to rebuild their open source.

Just because it's ubiquitous doesn't mean that it's good. Also, to be clear, x86 then became x86_64/amd64 which isn't the same architecture either. There will always be iterations and oddball architectures where something new can be learned and even reapplied to update x86. POWER, Sparc, etc. all taught new lessons.

Apple isn't scattering anything by running processors that run the same architecture as android and ios on phones. Most open source software can already be compiled for x86, arm, sparc, etc.

Except ARM CPUs have been vastly outselling x86 CPUs for a very long time, long before the M1 entered the scene. In just Q4 2020, 6.7 billion ARM-based devices shipped, while 275 million PCs shipped in all of 2020. Desktop PCs are only a small fraction of the total computing ecosystem.

The stuck-in-the-'90s "desktop is all there is" mindset is a weird holdover from the early growth of PCs in developed countries. If you look at emerging markets, mobile is completely dominant.

Raw sales numbers are going to be biased because ARM is like Zerg and x86 is Protoss. In the ARM world there isn't the same concept of a central processor so normally lots of chips get built into each individual device.

ARM has also been historically used most often on proprietary systems you need authorization to develop for. So it's made less sense as a target for open source tooling hack projects like this one.

ARM also has so many sub-targets that it's almost like a coalition of ISAs rather than a unified one like x86. So adding ARM support to Actually Portable Executable might not be as simple as including an ARM build in the binary. We might need to have multiple ARM builds for its microarchitectures. Because ARM users want resource efficiency and they're not going to be happy with a generalized build that broadly targets ARM; they want code that's narrowly targeted to the specific revisions of the processor that they're using.

In other words, we can't give ARM users portable binaries because ARM users do not want them.

I also always thought that code for other architectures was the kind of thing that mostly got contributed by the people who build those architectures. Things like how IBM always graces our GitHub issues with patches each time our code doesn't work on s390x mainframes. I like that they do it by contributing patches rather than the feedback of why don't you support this? Why don't you support that? Oh I didn't say I actually needed it.

> We might need to have multiple ARM builds for its microarchitectures.

Nitpick: it is by definition not needed to have different build for different microarchitectures (except maybe for performance). If the same code doesn't work on different chips, it's because they have different ('versions of') instruction set architectures.

Edit: nevermind, on rereading, you were already complaining about that. You should probably stick scare-quotes on "'need'", though.

> they're not going to be happy with a generalized build that broadly targets ARM; they want code that's narrowly targeted to the specific revisions of the processor that they're using.

Apple already does this for x86: macOS contains duplicates of all operating system binaries & libraries compiled for pre- vs. post-Haswell processors.

Hahaha, I love the StarCraft analogy!

I really want to know how we can work Banelings into a hilariously extended metaphor here. :P

(Rule 5: if you haven't played SC2, Banelings were a super in-character addition to the zerg unit tree. Zerglings are able to morph into a variant called the "baneling" that's a fairly powerful suicide bomb. Rather like the "scourge" air units in SC1, except a ground-only attack. Fairly good at rushing bunkers, but die pretty fast to siege tanks.

Really loved the design though - fit the zerg ethos really well.)

I'm not sure bursting in with an "EXCEPT!" is quite the correct response to a satirical bible quote.

How do PC sales get tracked when a lot of builds are put together from parts at mom-and-pop stores or by the customers themselves?

There's only two vendors selling CPUs for build your own PCs. Track their sales, and there's your PC numbers.

Lol this made my morning

I read that in the voice of Cecil B. DeMille narrating in The Ten Commandments. Well done.

Near consensus on x86? But aren’t most computers in this world actually smartphones?

I'm not sure what you're saying, but I appreciate that you put more effort into saying it than ten other HN posts combined.

It's an adapted Bible quote. Genesis 11:6, about the tower of Babel.

Cool, thanks for the explanation. It sounded Biblical, but I wasn't sure.

I think it was an allusion to the tower of Babel.

> A world with 2 architectures for mainstream use cases is the future

And the past, I mean PowerPC was a thing for a long time on both Apple desktop systems and servers.

Glad that the compiler toolchains make this transition a lot easier, and Apple has been a major contributor to that. It helped ease the transition from 32 bit ARM to 64 bit ARM, it enabled easy cross-platform apps (Mac Catalyst) and now from x86 to ARM for desktop apps.

And of course it's not particularly new; 25 years ago (ish) Java came out that promised the same thing, one codebase that runs on all architectures. Scripting languages, too.

PowerPC was also the dominant embedded platform for two decades (the rover that just landed on Mars is PPC). Point being, ARM's presence in highly vertical markets like embedded or phones means very little to desktops and servers outside of Apple.

Only 2 architectures? Hah! I remember when there was X86, Sparc, MIPS, PowerPC, M68K, and Alpha, all in relatively common use. There were a few Itanium, S390x, and other weird things floating around too. (MIPS, PowerPC, and S390x are still hanging around in niche applications today.)

Portability is not hard. If you write standard C/C++ that does not depend on undefined behavior (like wild-ass pointer casts, etc.) you will be fine 99% of the time. Use newer languages like Go and Rust or higher-level languages and you won't even notice.

The only hard areas where labor intensive porting is needed are hand rolled ASM or the use of CPU-specific extensions like vector code (e.g. __m128i and friends). That's a tiny fraction of code written and is generally confined to things like codecs, graphics engines, crypto, and math kernels.

The problem with C/C++ and "newer" languages is that programs need to be individually compiled or have an interpreter installed to execute them - the main problem the author is solving.

The author precisely realises that there used to be multiple architectures, just like you, but also notices that we have converges on x86-64 - what she terms the lingua franca.

I also completely follows her sentiment, that we should not switch ISA unless there is a very real computation per power unit benefit of doing so.

> Portability is not hard. If you write standard C/C++ that does not depend on undefined behavior (like wild-ass pointer casts, etc.) you will be fine 99% of the time.

This is not my area of expertise, so I'm not the one to write the rebuttal, but it seems that "---- is not hard" is never anything more than an invitation to someone who fully understands ---- to explain why it is hard.

Succinctly, serving 99% of the use cases with no effort mainly seems to be a recipe for making sure that, when one hits those 1% problems, one has no idea how to deal with them. I suspect that portability is one of those things where it's easy to do a mediocre job but hard to do a good/robust job.

Does ARM really have a performance advantage? Or is it the specific Apple customizations tailored to their use case?

Apple doesn't have to worry about 35 years of legacy architecture to support.

> Or is it the specific Apple customizations tailored to their use case?

Apple's use case is "run applications". It's not like there's any magic or they have some sort of ultra specific workload they improved by 10x while the rest sat there.

Apple's customisations are largely "throw hardware at the problem", which I'm reasonably sure Intel would do if that worked for x86. So sounds like something you can do with ARM, which you can't with x86.

The more magical customisations are workload specific, but then they would only trigger for these workloads, both of which are pretty much opt-in: running emulated x64 code on ARM, and performing matrix computations (which AFAIK will only be used through the Accelerate framework).

As far as I understood, some of the reasons M1 is fast are in fact specific to ARM. For Instance, the advantages given by the width of the decode depend partly on the uniformity of AMR instruction size, and M1 also benefits from looser ordering of memory operations

Intel would do that if they could shrink their transistors. But because they are still at 14NM they are heavily constrained. It's actually amazing they are competitive at all given they are now 3 generations behind in manufacturing.

> So sounds like something you can do with ARM, which you can't with x86.

There's not reason why Intel couldn't, but they don't have the incentive to hyper-optimize frequently used Apple workloads like Final Cut Pro.

> There's not reason why Intel couldn't

If Intel could they would, for years now they’ve been spending billions to get fraction of a pc improvements on benchmarks. You really think if they could increase die size by 10% and get 30% better perfs they’d say no? Come on.

> they don't have the incentive to hyper-optimize frequently used Apple workloads like Final Cut Pro.

Except M1’s performance improvements show up across the board including software which has no relation to Apple, so this is just complete nonsense.

Performance is agnostic of ISA. Apple's custom designed cores do indeed have a massive performance/Watt advantage over x86 based designs and happen to be using ARM. However, it's not impossible for an x86 CPU to be designed in a similar way. It does, however, get more difficult to do so due to x86's variable length instruction encoding, to which ARM does not have.

x86’s instruction decoder suffers from its inability to parallelize some things. Because instructions have no fixed boundary,[a] something has to process the bytes sequentially. Even if they can be read from memory in massive amounts, something still has to sit there going byte by byte to find the boundaries.

The good news is, once those boundaries are found, uops can be generated. But that ~5% or so of die space is always running full tilt (provided there’s no pipeline stalls).

I’m sure Intel and AMD have put a massive amount of work into theirs to make it as quick as possible,[b] but it’s still ultimately a sequential operation.

With RISC-like architectures like ARM and RISC-V, you don’t need that boundary detector. Just feed the 2 or 4 bytes straight into the decoders.

[a]: Unlike ARM and RISC-V which have fixed 2 or 4 byte encodings (depending on processor mode), x86’s instructions can be anywhere from 1 through 15 bytes.

[b]: Take the EVEX prefix for example. It is always 4 bytes long with the first one being 0x62. So, once you see that 0x62 byte after the optional “legacy prefixes”, you can skip 3 bytes and go to the opcode. But then you need to decode that opcode to see if it has a ModR/M byte, decode that (partially) to see if there’s an SIB byte, decode that to see if there’s a displacement (of 1, 2, or 4 bytes), etc. And then, don’t forget about the immediate (which can be 1, 2, 4, or (in one case of MOV) 8 bytes).

Something has been bugging me about x86’s lack of boundaries...could the boundaries be computed ahead-of-time and passed to the processor?

Not that I’m aware of. The decoding of an instruction is complicated and also dependent on the current operating mode and a few other things. So, for an OS to pass those lengths before hand, it’d have to know everything about the current state of the processor at that instruction. For example, in 16 and 32 bit modes, opcodes 0x40 through 0x4F are single byte INC and DEC (one for each register). In 64 bit mode, those are the single byte REX prefixes; The actual opcode follows. See also: the halting problem.

As for why it became an issue, instruction sets need to be designed from the beginning to be forward expandable. Intel has historically not done that with x86. Take AVX for example. Originally, it was just 128 bit (XMM) vectors encoded as an opcode with various prefix bytes being used in ways they weren’t intended. Later, 256 bit vectors were needed. So they made the VEX prefix. But it only had 1 bit for vector length. This allowed 128 bit (XMM) and 256 bit (YMM) vectors, but nothing else. So when AVX-512 came along, Intel had to ditch it and create the EVEX prefix and allow both to be used. But EVEX only has 2 bits for vector length. So, should something past AVX-512 come out (AVX-768 or AVX-1024?), it’ll probably use the reserved bit pattern 11, and they’ll be stuck again if they want to go past that.

For an example of this being done right, ForwardCom[0] (started by the great Agner Fog) took the “forward compatibility” (hence the name) issue into mind and used 2 bits to signal the instruction length. It’ll probably never reach silicon, but it and RISC-V (which is in silicon form) are good examples of attempting to keep things forward compatible.

[0]: https://forwardcom.info/

> Not that I’m aware of. The decoding of an instruction is complicated and also dependent on the current operating mode and a few other things. So, for an OS to pass those lengths before hand, it’d have to know everything about the current state of the processor at that instruction

The compiler would know the instruction boundaries. It could store that information in a read-only section in the executable. The OS would then just pass that section to the CPU somehow.

I don't think there is anything impossible about this. Would there be sufficient performance benefit to justify the added complexity? I don't know, quite possibly not.

This sounds like a potential attack vector.

I'm not sure why it would be. If the boundary information were wrong, the CPU instruction decode would fail, but that should just be an invalid instruction exception, which operating systems already know how to handle.

"Performance is agnostic of ISA" is too strong a statement. The variable length instruction encoding is a significant performance disadvantage, as is the strict memory ordering requirement of X86/X64.

X64 decoders are indeed only ~5% of the die on a modern CPU, but it's 5% that is always at 100% utilization. That's a non-trivial amount of extra power. X64 decode parallelism is also limited. I've heard four instructions at once as a magic number beyond which it becomes really hard. This is why hyperthreading (SMT) is so common on X64 chips. It's a "cheat" to keep the pipeline full by decoding two different streams in parallel (allowing 8X parallelism). SMT isn't free though. It drags in a lot of complexity at the register file, pipeline, and scheduler levels, and is a bit of a security minefield due to spectre-style attacks. All that complexity adds more overhead and therefore more power consumption as well as taking up die space that could be used for more cores, wider cores, more cache, etc.

ARM is just a lot easier to optimize and crank up performance than X86. The M1 apparently has 8X wide instruction decode, and with fixed length instructions it would be trivial to take it to 16X or 32X if there was benefit to that. I could definitely imagine something like a 16X wide ARM64 core at 3nm capable of achieving up to 16X instruction level parallelism as well as supporting really wide vector operations at really high throughput. Put like 16 of those on a die and we're really far beyond X64 performance in every category.

This is also why SMT/hyperthreading doesn't really exist in the ARM world. There's less to be gained from it. Better to have a simpler core and more of them.

IMHO X86/X64 has hit a performance wall at least in terms of power/performance, and this time it might be insurmountable due to variable length instructions and associated overhead. It matters in the data center as well as for mobile and laptops. There's a reason AWS is pricing to steer people toward Graviton: it costs less to run. Power is the largest component of most data center costs.

While it’s absolutely true that fixed width instructions make parallel decoding vastly easier, there’s a cost in terms of binary footprint size. x86 generally has an advantage in instruction cache and TLB performance for this reason, which can be significant depending on the workload.

Not true. This is a common myth that comes from some old Linus posts in the 32-bit Pentium 4 days and still won't die. I've done comparisons to test this. Compare sizes of modern x86-64 Linux binaries to their counterparts on AArch64. You'll find that they're extremely close.

The biggest problem is all the REX prefixes. The inefficient encoding of registers in x86-64 squandered all the advantages that x86 had.

Is true. They said:

> > x86 generally has an advantage [empahsis added, not "x86-64"]

Obviously if you take the worst of both worlds (bloated and variable-width instructions), you can squander that advantage, but the advantage is in fact real.

Is this still really relevant? I can understood that it can be a problem 20 years ago, but with current processor with huge L1 cache and memory bandwidth, I am starting to think that 4 bytes (or variable 4/8 bytes) is not a bad tradeoff for density Vs superscalar.

L1 size in 1999: 32 kB

L1 size in 2021: 64 kB

The L1 size is yet another place where the x86 legacy hinders things. To avoid aliasing in a virtually indexed L1 cache (which is what you want for performance in a L1 cache, since a physically indexed cache would have to wait for the TLB lookup), the size of each way is limited to the page size, which on x86 is 4096 bytes. To get a 64 KiB L1 cache, it would have to be a 16-way cache, and increasing that too much makes the cache slower and more power-hungry. It's no wonder Apple decided to use a 16 KiB page size instead of a 4 KiB page size; a 64 KiB VIPT L1 cache with 16 KiB page size needs only 4 ways.

For the L1 instruction cache, aliasing shouldn't be a problem (since it's never written to), but this is once again another place where the x86 legacy hinders things: instead of requiring an explicit instruction to invalidate a virtual address in the instruction cache, it's implicitly invalidated when writing to that address.

Apple M1 big core cache sizes:

256KB L1I/128KB L1D

Little cores: 128KB L1I/64KB L1D

Wow. Didn't know that. That should more than compensate for a very slight increase in code size for ARM64 vs X64.

When I use M1, AWS Graviton, or even older Cavium ThunderX chips I can't help but think that X86 is on its way out. The advantage is something you can subjectively see and feel. It's obvious, especially when it comes to power consumption.

Process node has something to do with it, but it's not the whole story. I'm typing on a 10nm Ice Lake MacBook Air and while this chip is better than older 14nm Intel laptops it's still just shockingly crushed by the M1 on every metric. 10nm -> 5nm is not enough to explain that, especially since apparently Intel is more conservative with its numbering and Intel 10nm is more comparable to TSMC 7nm. So it's more like TSMC 7nm vs TSMC 5nm, which is not a large enough gap to account for what seems to be at least 1.5X better performance and 3X better power efficiency.

Some of the X86/X64 apologists remind me of old school aerospace companies dissing not only SpaceX and Blue Origin but the whole idea of reusable rockets, trying to convince us that there's little economic advantage in reusing a $100M rocket stage that consumes ~$100-200K in fuel per launch.

"That's not much of a meteorite. It's no big deal." - Dinosaurs

Que? Look at VLIW ISA's for five minutes and tell me how you've arrived at "agnostic".

Agnostic is a little strong, although it is true that M1 is extremely wide especially for a laptop chip, and wide in ways beyond the decoder which could be applied to an X86 part.

Ultimately these discussions are quite hard because AMD aren't on exactly the same density, and Intel are quite a way behind at the moment.

It currently has a performance per watt advantage because of a fundamental design difference (smaller, simpler, many cores) which works great for mobile and can be scaled up to desktop/server rather than trying to scale down x86.

It seems we are finally going back to the ecosystem of the 90's with multiple processors. This was the genesis of Java at the time and the promise of Write Once Run Anywhere was quite appealing to many developers at the time.

Back then IBM Mainframes still had as strong foothold in large corporate IT departments. Sun had a dominant position as well for most newer companies. If you wanted multiple CPU's with redundant fail over and gigs of RAM Sun was your huckleberry back in the day.

It seems like the current iteration is that modern build systems provide the “write once run anywhere” rather than virtual machines, which have their own compatibility and performance issues.

It’s trivial nowadays to write a program in Go or Rust and deploy it to whatever architecture you want, without any arcane knowledge of the build process

> It’s trivial nowadays to write a program in Go or Rust and deploy it to whatever architecture you want

According to rust docs [1] and go wikipedia page [2] both have mainly support for x86, while go recently added support for macos/arm and in 2019 windows/arm, rust only has tier 1 ("guaranteed to work") support for arm-linux and x86.

Am I misreading this? It does not seem "trivial" to me for arbitrary platforms.

[1] https://doc.rust-lang.org/nightly/rustc/platform-support.htm...

[2] https://en.wikipedia.org/wiki/Go_(programming_language)#Vers...

A significant barrier to getting platforms to Tier 1 support for Rust is actual hardware to run CI on. Tier 1 is an extremely high bar for support.

I do my job at work every day on a Tier 2 ARM target, and in practice, don't notice any difference from the Tier 1 targets. YMMV of course.

Thank you, that helped me put it into perspective!

I'm not sure about Go, but Rust should work on everything LLVM can emit native code for. While ARM may not be listed as "tier 1", Rust worked on M1's on launch day because of LLVM portability.

I very definitely read that tongue-in-cheek. Her project targets everything, so long as it's using AMD64, therefore anything _not_ AMD64 is useless, as it can't run her project.

How many toolchains do not have Arm64 support? Cross-compiling is ancient and most tools predate x86 being useful.

This is a follow-up to

Show HN: Redbean – Single-file distributable web server - https://news.ycombinator.com/item?id=26271117 - Feb 2021 (141 comments)

... which is still high on the front page. Also there was a big thread last year, which is still within the dupe window:

αcτµαlly pδrταblε εxεcµταblε - https://news.ycombinator.com/item?id=24256883 - Aug 2020 (286 comments)

We downweight follow-ups because otherwise the front page gets too repetitive and repetition is mainly what we try to avoid here:



There is no way to find the other thread using the site search though

the what now?

There’s a search bar right at the bottom of HN pages.

... there's content at the bottom? O_o

Closely related : this "Show HN" of an Actually Portable Executable for a web server, published earlier today by the author : https://news.ycombinator.com/item?id=26271117

Yep, there are usually a lot of "piggybacking" (This comment is not mean spirited, just stating a fact) in HN. I made a similar comment a while back [1]


You do know that it's the same person, right? They're both links to articles on her website. There is no "piggybacking", this is literally her writing about making these things.

No, "piggybacking" refers to the posting of related material on HN after the original material becomes popular. The posting on HN is the piggybacking, not the writing of the material itself.

What? This link was submitted by someone who found the article, liked it, and posted it here in the assumption others would like it too. That's literally the entire model of how HN works, that's not "piggybacking", that's exactly what HN wants you to do for it to serve the interesting articles on the web to its user base.

Using completely wrong greek letters in the title is making me very uneasy for no good reason

I used to feel the same way, and this is indeed an annoying practice. Yet here it makes perfect sense, since this work is based on using certain symbols (e.g., header magic numbers in one executable file format) according to a non-intended interpretation based on casual and meaningless similarities (e.g., as machine instructions in another executable file format).

It doesn't make perfect sense. If the name is actually "Actually Portable Executable" then all users should be able to read it that way.

If it is only stylized as "αcτµαlly pδrταblε εxεcµταblε" a readable name should be available alongside the visual styling (using ARIA attributes, for example).

Mouse over the title on the webpage

Ok, it shows a tooltip.

The `title` attribute should specifically not be used here; ARIA labels should. The title attribute is implemented differently across browsers and assistive technologies [1] and is supposed to be a title, not a duplicate of the content of the element.


But the metaphor doesn't work. I understand the intent, but if anything, it's more appropriate for an error-correcting code.


The HTML is accessible:

    <h1 title="Actually Portable Executable">αcτµαlly ...

No, that should be an ARIA label. Specifically NOT a `title` attribute.

A title provides additional (not redundant) info and browsers and assistive technologies implement the attribute differently.

The whole page is written in beautiful HTML also (probably by hand?)

    <p style="float:right">

Sure looks handwritten to me. I guess "beautiful" is subjective; the <center> element being deprecated is not.

ahhh fair!!! Unfortunately only the page title. Elsewhere in the document it's not annotated

    the <a href="https://raw.githubusercontent.com/jart/cosmopolitan/667ab245fe0326972b7da52a95da97125d61c8cf/ape/ape.S">αcτµαlly pδrταblε εxεcµταblε</a> format
still obnoxious though imho

Why, do the people with the screen reader have some specific need to read the title of this article? As if it's some important resource or something?

It's just one irrelevant thing they can't read, same as millions of articles written in different languages...

Because the title tells you what the whole page is about. You can know when to continue or not sometimes with just the title.

Unforuntatly screen readers will read it out as

"which implements the alpha-see-tau-micro-alpha-lly P-delta-alpha-R-tau-A-B-L-epsilon..." etc.

Contrasted with a foreign language document which we would expect to be either read out in the foreign language, or mechnically translated and then read it.

Users with or without screen readers could reasonably expect to read plain text.

Now you reminded me of a book that my grand father has, where the title is something like:

Яцssiди Сдяs

And it’s just so horrible to make a title using Cyrillic characters according to what looks like Latin and not according to their actual sounds XD

> I chose the name because I like the idea of having the freedom to write software without restrictions that transcends traditional boundaries.

Actmallu pdrtable execmtable

Sorry to ruin the joke, but y is actually in Latin script.

Yeah, that's why I tried to pick the closest Greek has to a y sound, which I think (I only know a bit of GCSE ancient greek from 30 years ago) is upsilon. If I'd read it as the letter that looks most like a y it would have been "actmallg" (gamma).

As someone who reads and speaks (μόνο ενά λίγο - only a little) Greek, same.

There's a subreddit for parodying the phenomenon - http://reddit.com/r/grssk

Since I think this is a study in compatibility and not meant to be paradigm changing new programming concept I think the author is having fun with all of it rather than being overly serious and not concerned about being bookish

I mean, i keep reading it as the letters themselves; really the only two that bothered me were delta ~ d and not o, and mu is m and not u

I read your comment thinking you were being stodgy, but then I went to the site and had the same reaction.

Can't imagine that doing so is going over well for people using screen-readers.

> One of the reasons why I love working with a lot of these old unsexy technologies, is that I want any software work I'm involved in to stand the test of time with minimal toil.

Could've written a win32 program.

Could it disguise as a WinRT program?

It doesn't need to.

Unless it wants to pass the built in anti malware filter.

What’s up with your computer where Defender flags any non-RT programs? I don’t have that issue.

This got my upvote at "zip source file embedding could be a more socially conscious way of wasting resources in order to gain appeal with the non-classical software consumer".

"The most compelling use case for making x86-64-linux-gnu as tiny as possible, with the availability of full emulation, is that it enables normal simple native programs to run everywhere including web browsers by default....I think we need compatibility glue that just runs programs, ignores the systems, and treats x86_64-linux-gnu as a canonical software encoding."


Just a smiley, no other words!

Very interesting, but every one of these executives I try on my fairly stock Ubuntu system returns 'run-detectors: unable to find an interpreter'.

I'm invoking them with 'bash -c'.

Author here. That error means you're using binfmt_misc. You can fix that by saying:

    sudo sh -c "echo ':APE:M::MZqFpD::/bin/sh:' >/proc/sys/fs/binfmt_misc/register"
Then you're good to go!


Can someone explain the advantage over building an executable for each target system?

Not having to build an executable for each target system.

I think it’s neat how this acknowledges the reality that the actual meat of the machine code is identical for every x86_64 target— all that's different is the OS interface. So unlike other "fat binary" schemes where there's a lot of duplication, this one has a single main program and then small shims to provide the Linux ABI on MacOS and Windows.

I don't think it has any benefit if you're installing software exclusively that you built yourself on your own targets, or from a distro package manager. But it's potentially a boon for a whole class of statically-linked rescue tools, installers, command-line utilities — basically anything where there's a website with a curl path/to/thing > local/bin/thing installation option.

It also makes manually downloaded software distribution easier. Rather than the user having to select which version of the software to download (which users often get wrong), or trying to guess based on browser user-agent, there's just the one download link that works on everything.

And malware. I don't know why that popped into my mind as the first use-case for this and the web server. :|

This is slightly faster I guess?

I don't think it's that big of a deal either, since compiling these days is fast enough you can do it 3 times without it being a problem.

Don't get me wrong, it's very impressive, I just don't think it makes that big of a difference in practice, especially since environmental differences will still require you to have 2 codebases in many scenarios (like accessing the filesystem for example)

Every source-portable program has that anyway though, typically either with a bunch of ifdefs, or by linking to an abstraction like boost::filesystem.

The change here would basically be that all versions of it would have to be compiled into the same binary, with a runtime switch.

Nope much slower. And never worked for me.

It's cool.

It is very satisfying.

I think heterogeneous computing is actually coming this time. Increasing binary size requirements wherever portability is required will be an intended sacrifice toward that aim (but for app store based distributions - the only place required to pay the binary size tax will be in the size of the bundle provided to the app store vendor).

I think the importance of ISAs will fade away generally in favor of specifications that enable coordination of higher level memory model semantics "across" compute resources -- the cpu/compute core becomes the thing that allows you to share reads and transfer write ownership of memory as efficiently as possible between heterogeneous components that operate on the compute graph -- and many of these compute components may require various binary forms of task specific instruction encoding ...

I don't understand how this provides a POSIX API on Windows.

There is a tcsetattr function in Cosmopolitan.

If I use that to obtain character-a-a-time input with no echo, and then run the portable executable Windows, will that have the right effect in the console window?

Yup. See https://justine.lol/blinkenlights/blinkenlights-windows.png and https://github.com/jart/cosmopolitan/blob/fcfe7c108083962a3d... and https://github.com/jart/cosmopolitan/blob/fcfe7c108083962a3d... and https://github.com/jart/cosmopolitan/blob/fcfe7c108083962a3d... and https://github.com/jart/cosmopolitan/blob/fcfe7c108083962a3d... It's not perfect though. For example, right now Cosmopolitan won't do the really dirty hacks that Cygwin does to fully simulate POSIX like creating a virtual filesystem or spawning daemons and threads for handling signals. Cosmopolitan does however give you 90% the value at 10% the price. This has been great for my use case, since I'm mostly only concerned with greenfield development. I wanted the feel of POSIX but I didn't need to check off a box with a regulatory body that it conforms perfectly to POSIX. I'm also not trying to create a distro that leverages all the open source works written to date; Cygwin and MinGW are already doing a great job at that and I view Cosmopolitan as complementary.

I needed a way to port program to Windows with accurate POSIX, like Cygwin, but without the Cygwin paths and virtual file system and other user-visible quirks.

In under twenty or so fairly simple commits to a fork of the Cygwin DLL, I got it:


This is another useful tool in this general arsenal.

A clever solution but still dependent on qemu.

I think it only uses qemu if you attempt to execute on non-x86 architectures. So it’s not a build-time dependency.

Right. I believe the long term vision is to JIT for other architectures.

Actually, I wonder why not every OS comes with a Posix shell and an Python interpreter nowadays. Posix shells should be super easy, because most systems have them onboard already anyways. However, since Posix shells are kinda broken, I think Python should be the next iteration.

Just to give some context, I am not a Python person, as I prefer Go. But given the popularity and the suited use-cases I think it is a good option.

The problem is bash et al are languages designed for the command line. Every line is a separate command.


    cat test.txt | grep search

    import os
    import subprocess
    with open('test.txt', 'r') as f:
        for line in f:
            line = line.rstrip()
            subprocess.call(['/bin/grep', line, 'search'])
While the first may use some “magic” symbols such as the pipe, it’s really concise in conveying what it’s doing.

I will give you this: bash variables and expansions can be confusing. Contrast with programming where this probably works:

    "start" + variable + "end"
[0]: https://stackoverflow.com/a/9018183/1350209

My problem with shells is not just about the obscure syntax, but rather about the point, that it is next to impossible to write reliable and reusable scripts.

By default return codes of failed commands are silently ignored and `set -e` does not work under all circumstances. By default every variable is part of the global scope and the best you can do about it in a POSIX compliant way are sub-shells, which in turn have no way to change variables outside of their scope.

It is just broken by design :-/

It's because shells are really designed as ways to run and manage tasks (subprocesses). So anything not related to that is at best secondary, and a pain in the ass.

The opposite being true for your average general-purpose program, where managing tasks is a secondary concern and delegated to a library.

Not only is the intent of those snippets rather different, you've rather misunderstood the original to the extent that it's broken.

The bash version looks for the pattern `search` in every line of `test.txt`, the Python version treats `test.txt` as a file of patterns, and look for each of these patterns in the file `search`.

And of course you wouldn't implement the bash version in python that way as it's rather trivial to do it in Python:

    out = [line for line in open('test.txt') if 'search' in line]
or somesuch.

As somewhat of a Python developer these days, I have to point out that each 3.something release of Python potentially breaks backwards compatibility.

Second issue is that Python without any extra modules is still pretty limited, nearly every serious python project comes with some extra dependencies.

So have a Python interpreter available only gets you so far...

(For the record, the same could be said about pretty much all dynamic languages I'm familiar with).

Maybe a POSIX shell, as the POSIX shell standard is small and more most purposes fixed. But you don't want to use the OS python, as it is inevitably old and outdated.

> αcτµαlly pδrταblε εxεcµταblε

As a Greek, if you do this, I hate you. Why the hell do you have to make me read "actmally pdrtable execmtable"? At least this is one of the less offensive cases.

EDIT: Solidarity to our Cyrillic friends!

Author here. I wanted to honor Greece for the amazing cultural impact they've had, similar to how mathematics honors Greece. We got a lot of comments like this in the last thread. What dang said about it was really smart: https://news.ycombinator.com/item?id=24264514

What dang said about it was not smart.

> it's good for readers to have to work a little

Unless they're using assistive technologies. In that case it's a nightmare. Don't make your users work.

> it's not hard for any HN reader to do the bit of work to figure it out

Unless they're using assistive technologies. Or just want to read it without work.

Or, say searching for it. This post comes up. The one you linked to doesn't.


Respect to you for wanting to honor Greece. I think using the letters* correctly would honor them more. (thanks for the correction)

> I think using Cyrillic correctly would honor them more.

(Greece doesn't use Cyrillic but I agree with you otherwise)

Yes. My screen reader, at least Voiceover on my phone, had a stroke reading that. I had to navigate letter by letter and guess what it meant. But it's also quite common so I'm used to doing that regardless.

Ah, I don't want to make a fuss about it (my comment was tongue-in-cheek), it's really not a big deal, but it is annoying to spend 2-3 seconds trying to figure out if you're having a stroke, and then some more trying to suss out what the sentence is actually trying to say.

If you want to honor Greece, use the letters as they're meant to be used! "Acτuaλλy πoρτabλe εxecuταbλe" would be much better (though I've intentionally tried to give English readers a stroke with this one :)!

The entire project is built on not using things the way they are meant to be used, though. The name is kind of doing the exact same thing the code is.

Though, oddly enough, the English letters are used exactly how they're meant to be used :P

So the code is "valid" on all platforms, but actually crashes on the "Greek" one?

It's actually not smart at all. Replacing the letters in the Roman alphabet with Greek letters based on superficial resemblance is not any different from replacing the "R" with "Я" when writing about anything Russian-related (you see it stupidly used in book covers, t-shirts, etc).

How does this do anything to honor the cultural legacy of Greece? Perhaps we could honor the legacy of 19th century mathematics by using Fraktur characters when they resemble Latin ones?

When people who can read Greek are telling you it's bad taste maybe take their word for it! Not dang.

What you're really saying is that the Greek alphabet (and by extension its language community) is so insignificant compared to Latin that the cost of potential misrecognition is so low that it can be disregarded. This is chauvinism, not "honoring Greek mathematics"!

Word. I'm still trying to find out who Doidld Tyatsmr is and why is he so hated in the US.

I know this is a loaded question, but are there any resources you can point to in learning the linux syscall stuff, or perhaps writing a C compiler from scratch? I thought I had a fairly good grasp of this stuff but after looking through cosmopolitan code, I realized Im not even close.

Rui is writing a book for the chibicc compiler in the cosmo codebase. I should probably write a book on system interfaces since there's no school for it. I had to go straight to the primary materials, i.e. the source to pretty much every existing kernel and libc along with the historical ones in order to understand the origin of influence. That's what helped me have a razor sharp focus on the commonalities which made this project possible.

So I'd say that the SVR4 source code would be a good place for you to start. It's like ambrosia and once you've read it you can always tell by reading modern code which developers have and haven't seen it. There's also the Lions' Commentary on Unix. I highly recommend Richard W. Stevens. The last book on the required reading list is BOFH.

> learning the linux syscall stuff

I've been studying this for a while. Turns out Linux has an amazing interface. It's stable and language-agnostic. All you need to do is put the values in specific registers and execute a special instruction. The result comes back in one of those same registers.

The high level documentation is here:





On Windows there is a similar interface but it is not stable. The system call numbers can change. Developers are supposed to use the good old Microsoft DLLs in order to get anything done. Just like how everyone uses libc on other systems.

Linux is different. The system call binary interface is the Linux interface. So it's actually possible to trash all of GNU and rewrite the entire Linux user space in Rust or Lisp or whatever. It doesn't have to be written in C. It doesn't even have to be POSIX compliant. Could be GUI-focused!

All you need to make any x86_64 Linux system call is this code:

  system_call(long number, long _1, long _2, long _3, long _4, long _5, long _6)
      register long rax __asm__("rax") = number;
      register long rdi __asm__("rdi") = _1;
      register long rsi __asm__("rsi") = _2;
      register long rdx __asm__("rdx") = _3;
      register long r10 __asm__("r10") = _4;
      register long r8  __asm__("r8")  = _5;
      register long r9  __asm__("r9")  = _6;

      /* r8, r9 and r10 may be clobbered but can't be in the clobbers list
         because the compiler won't use clobbered registers as inputs.
         So they're placed in the outputs list instead. */
      __asm__ volatile

       : "+r" (rax),
         "+r" (r8), "+r" (r9), "+r" (r10)
       : "r" (rdi), "r" (rsi), "r" (rdx)
       : "rcx", "r11", "cc", "memory");

      return rax;
This is all you need to do anything. You can perform I/O. You can allocate memory. You can obtain your terminal's dimensions. You can perform ioctl's to your laptop's camera. You could make a new programming language today and all it really needs to be complete is this single function. What if instead of having this function the compiler could simply emit code that conform to this binary interface? The language could have a system_call keyword that generates Linux system call code!

Once I realized this I tried to turn it into a library called liblinux... I stopped working on it when I found out the kernel already has an awesome single file header you can include that lets you build freestanding Linux executables for a ton of architectures. They use it on the kernel to build their own tools!


It even includes process entry point code! Linux copies the argument and environment vectors to the stack before entering the executable. The process start up code obtains those pointers and passes them to the main function. It also ensures the exit system call is called.

The process entry point is usually called _start because that's what linkers look for by default. In reality the ELF header has a pointer to the program's entry point, the actual symbol doesn't matter. You can tell the linker to set it to any other address or symbol. Also note that it's an entry point, not a function. There is no return address. Allowing that code to terminate results in a segmentation violation. Hence the need to ensure exit is called before that happens.

The only feature that seems to be missing is support for the table of auxiliary values:


The auxiliary values are placed on the stack immediately after the environment vector. So all you need to do to find this pointer is loop through it until it goes out of bounds. I wrote this code and it works:


  struct auxiliary { Elf64_Off type; Elf64_Off value; };

  static void *after(void *vector)
      void **pointer = (void **) vector;
      while (*pointer++ != 0);
      return pointer;

  int liblinux_start(void *stack_pointer)
      long count;
      char **arguments;
      char **environment;
      struct auxiliary *values;

      count = *((long *) stack_pointer);
      arguments = ((char **) stack_pointer) + 1;
      environment = arguments + count + 1;
      values = after(environment);

      return start(count, arguments, environment, values);
You can just loop over the pointer to the structure until you find one with type equal to AT_NULL. Example here:


Author here. You would like this project: https://chromium.googlesource.com/linux-syscall-support/ Thank you for reminding me of the joy I felt when I discovered this. I feel like you should publish this and post it on Hacker News. Because too many people who post here hold the viewpoint that SYSCALL is evil and you must link the platform libc dynamic shared object or else you're a very horrible person who deserves to have their binaries broken like Apple did to Go. But they wouldn't feel that way, if they could just see the beauty you described.

Thanks! Your projects are so inspiring. I too felt great joy discovering all this. Every time I see someone asking about system calls I respond by writing about everything I know. I usually don't get many replies... So happy to see another person who understands.

> You would like this project: https://chromium.googlesource.com/linux-syscall-support/

Yes, I would! I saw references to this library in your source code, specifically your jump slots implementation. I had no idea Chromium had this and I've been meaning to explore it later. I'm gonna do it now.

> Because too many people who post here hold the viewpoint that SYSCALL is evil and you must link the platform libc dynamic shared object or else you're a very horrible person who deserves to have their binaries broken like Apple did to Go.

I know what you mean! Using system calls are heavily discouraged by libc maintainers and even users. Using calls like clone will actually screw up the global state maintained by glibc threads implementation. It gets to the point where they don't even offer wrappers for system calls they don't want to support. I don't like it... What's the point of an amazing system call that lets you choose exactly which resources you want to share with a child task if all it's ever used for is some POSIX threads implementation?

Even the Linux manuals do this for some reason: the documentation I linked in my above post actually describe the glibc stuff as if it was part of the kernel and leaves the actual binary interfaces as an afterthought. Linux manuals also inexplicably host documentation for systemd instead of a generic description of how a Linux init system is supposed to interface with the kernel. It makes no sense to me!

I even asked Greg Kroah-Hartman about it on Reddit:


I actually think using the system call interface is better than using the C library. No thread local errno business, no global state anywhere, no buffering unless you do it explicitly, no C standard to keep in mind... It's just so simple it's amazing. It's also stable unlike other operating systems which ship user space libraries as the actual interface. On Linux there's no reason not to use it!

> I feel like you should publish this and post it on Hacker News.

I wrote a liblinux library, the README describes part of my journey learning about this system call stuff. Lots of LWN sources!


I've been thinking about expanding on it in order to describe everything I know about the Linux system call interface. You really think I should publish this?

The reason I didn't post liblinux here is it's in a very incomplete state and actually less practical than the kernel nolibc.h file. I only discovered the header much later into development and figured there was no point anymore since the kernel had a much better solution not only available but in actual use. I ended up rewriting autoconf in pure makefiles instead...

We're pretty much on the same page. I'm not sure if I share your enthusiasm for clone(), but I think the canonical Linux interface is what's going to save us from the dark patterns we see in userspace. I think everyone should learn how to use raw system calls. Because the first thought that's going to cross their mind is "wow I thought my C library was doing all these things" and then they're going to want a C library that offers more value than putting a number in the eax register.

For example if you want to call fork() using asm() then on Linux it's simple:

    int fork(void) {
      int ax;
      asm volatile("syscall" : "=a"(ax) : "0"(57) : "rcx", "r11", "memory", "cc");
      if (ax > -4096) errno = -ax, ax = -1;
      return ax;
But if you want to support XNU, FreeBSD, OpenBSD, FreeBSD, and NetBSD too, it gets a little trickier:

    int fork(void) {
      char cf;
      int ax, dx;
      ax = IsLinux() ? 57 : 2;
      if (IsXnu()) ax |= 0x2000000;
      asm volatile("clc\n\t"
                   : "+a"(ax), "=d"(dx), "=@ccc"(cf)
                   : "1"(0)
                   : "rcx", "r11", "memory", "cc");
      if (cf) ax = -ax;
      if (ax > -4096) errno = -ax, ax = -1;
      if (ax != -1) ax &= dx - 1;
      return ax;
Cosmopolitan abstracts stuff like that for you, but right now that's only if you're the kind of person who doesn't need threads. I imagine you are, since folks who do smart things with multiprocessing models like Go and Chromium usually don't want C libraries potentially stepping on their toes. Oh gosh threads. The day I figure out how to do those, will be day the whole world will want to use this thing. But I want people who use Cosmopolitan Libc to know what value it's providing them. I think the best way to do that is by raising awareness of the systems engineering fundamentals like this. Because that's something you're right to point out that the Linux community leadership has room for improvement on.

I remember simple use cases for clone() such as spawning child processes with just enough shared resources to execve(). I remember reading a lot of old emails from Torvalds about it, can't find them anymore.

I used to value portability but now I believe in using Linux everywhere and for everything. I like OpenBSD too but Linux is the stable one you can build anything on. What I wanted to eventually accomplish is a 100% freestanding Linux user space with no libraries at all. Maybe boot straight into the program I want to use, just like we can pass init=/usr/bin/bash in the kernel command line. How far could this go? Using nothing but system calls it's actually possible to get a framebuffer and use software renderering to draw some graphics. I'm guessing pretty far.

By starting from scratch like this it's possible to fix all the historical problems with our systems. For example, I think it's unacceptable when libraries keep global state. This can't be fixed without getting rid of libc and its buffers and caches and errno. Removing this cruft would actually simplify a threads implementation. And then there's completely insane stuff like .init and .fini sections:


A similar statically-linked user space project I found years ago:


That seems kind of contradictory. Your biases are tuned towards what works for big codebases but you're taking a first principles approach. A simplified threads implementation is called Java. Plus it gives you cool classes like Phaser. Threads will always be a horror show with C/C++. There's a reason why Linux is the only kernel that implements clone(). It's controversial.

I have to admit, every previous time I saw this linked I didn't bother clicking through, because from the title I thought it was a post mocking the concept of portable executables.

I'm guessing this the unfortunate consequence of the pattern "actually, " becoming a pejorative meme in the past year or so.

no, it's that the greek letters reminded me of the twitter "Im MoCkInG SoMeThInG sTuPid" format

I appreciate the good intentions, but confusing Greek readers doesn't seem to me like a good way to honor the cultural impact of Greece.

Aren't the Greek symbols used in math void of implicit meaning? You're taking a meaningful English sentence and replacing its letters with Greek letters while making it extremely difficult for people with disability on screen readers, those two things are not the same.

> The quality of this post is so high that it doesn't feel right to override any aspect of what the author created, including quirks like the title.

I agree with dang's feelings/thoughts about the issue.

Perhaps a solution would be to add the "normal" meaning between parenthesis, after the one in greek alphabet?

By the way, Justine: great work. Besides the obvious HN recognition, I wanted to tell you explicitly as well.

What are you going to work on in the near future? Curious to hear about it. If you don't want to post in public, $my_hn_username at gmail

Deep breaths. Deep, slow breaths. You're going to be just fine.

As a Russian speaker, I LOVE these things. Both ways (Russian letters abused to spell English words and vice versa.) In fact I miss old phones with just English keyboards where abuse to spell Russian words (eg CCCP) was an art form, for a brief period.

Especially since if you want to port the title to Greek lettering, you have upsilon and omicron for u and o:

αcτυαlly pοrταblε εxεcυταblε

Omicron looks exactly the same as the English o (it's not visually distinguishable in most typefaces) so it doesn't matter much, but upsilon is an "ee" sound usually, not an "oo" like in "actually" and "executable", so it wouldnt' work exactly. It would read "actially execeetable".

EDIT: For completeness, the full transliteration (or as close to it) would be "άξουαλι πόρταμπολ εξεκιούταμπολ". The extra "o" in "portabol" and "execiutabol" is actually a schwa, I think, so it can be omitted.

Upsilon is admittedly an "i" sound in Modern Greek, but in Attic Greek (which is what I studied, sorry) it did have the "oo" sound.

Edited: I missed the pi and rho completely though, my bad

Ah yes, you are correct!

Because it makes you look like you know what you're doing, not too different than obfuscating javascript for the sake of security, which does kind of work, at least on the lowest common denominator type of attacks, and this does kind of works too by having people think you're more of a genius than previously thought because you can turn boring English letters into something exotic which appeals to the ignorance of the masses [0].

[0] https://en.wikipedia.org/wiki/Argument_from_ignorance

you might be overthinking this

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact