Yes, most programs should disallow W|X by default. But trying to banish the entire practice with a mount flag, knowing full well few people will go that far to run a W|X application, is bad practice. I'd rather see this as another specialty chmod flag ala SUID, SGID, etc. Or something along those lines. One shouldn't have to enable filesystem-wide W|X just to run one application.
The thing is, when you actually do need W|X, there is no simple workaround. Many emulators and JITs need to be able to dynamically recompile instructions to native machine code to achieve acceptable performance (emulating a 3GHz processor is just not going to happen with an interpreter.) For a particularly busy dynamic recompiler, having to constantly call mprotect to toggle the page flags between W!X and X!W will impact performance too greatly, since that is a syscall requiring a kernel-level transition.
We also have app stores banning the use of this technique as well. This is a very troubling trend lately; it is throwing the baby out with the bathwater.
EDIT: tj responded to me on Twitter: "the per-mountpoint idea is just an initial method; it'll be refined as time goes on. i think per-binary w^x is in the pipeline." -- that will not only resolve my concerns, but in fact would be my ideal design to balance security and performance.
When SELinux is enforcing this on Linux, you don't. The trick is to map the page twice, once for writing and once for executing. Assuming both mappings are ASLR randomized, it's actually safer than allowing mprotect.
SELinux actually has OpenBSD beat on this one: in some configurations, written pages can not be made executable.
This is a very good point, and it seems like enabling it more granularly would be worth it.
Take Dolphin (the Gamecube / Wii emulator) for example: you have a 4GiB DVD, it has a massive 80-hour game's entire code engine on there. You cannot recompile the entire disc at startup. Even if you could (you really can't), these games tend to push code into RAM to execute, which a static recompiler cannot handle.
The way dynamic recompilers handle the tremendous burden is to recompile small blocks at a time, and track which ones are hot and cold. When the buffers fill up, they start dropping old, stale code.
Plus, I am sure there are many more uses cases for W|X than just emulators and JITs. It would be a shame to try and eradicate them all from existence.
Now in practice ICs usually hit (that's the point!) so once you've been running for a bit things should hopefully not need more recompiling. Which is a significant difference from the situation you describe.
The process would be:
1. compile the code when you fault on the basic block exit.
2. mark that basic block executable.
3. Optionally only after return or other jump: mark the jumping basic block as writable, patch the jump, then mark it as executable.
In this case, there would be three changes, JS JITs are doing this a lot more often than Dolphin is. They often have more than one level of JIT in addition to an interpreter; so they will end up doing this dance more than once per basic block.
Until I see at least some microbenchmarks and concrete estimates, I don't think I'll worry too much about this. Though it is unfortunate to have to modify all of this code.
> One day far in the future upstream software developers will understand that W^X violations are a tremendously risky practice and that style of programming will be banished outright.
Whoever wrote that they should be banned outright, I feel is being very short-sighted. It would be like banning cars because it's possible to seriously injure a person with one.
tj later replied to me on Twitter saying they meant for it to become per-process in the future. Once that happens, I'll be okay with this change. Right now, I think filesystem-level is far too broad. I often will want an emulator on the same filesystem as I want W^X protections for other applications on.
It's different from your typical language interpreter type JIT in that it's used as an optimisation for generating an optimal processing kernel. Once generated, it's only used once before it's thrown away, as a different input requires a different processing kernel.
Have actually tested using syscalls (not for W^X but rather as a potential workaround for newer Intel CPUs exhibiting weird SMC detection) and have found the overhead to be way too much, even for just one syscall (whilst switching between W/X would require two).
You're right that per-binary or per-file might be better in terms of providing a fine-grained approach, but on the other hand, I'd very much like to enforce this on a machine-wide basis on my workstations and servers, and having per-filesystem settings seems to be the most effective way to go about this, and I hope that remains an option (similar to OpenBSD's "nosuid" and "noexec" mount options).
To me that seems to make more sense than a global mount flag but I admit I'm not that knowledgeable about OpenBSD stuff. I suppose it's better late than never considering we've had this stuff in PaX since 2000.
A ELF-header-flag to allow W|X on a per binary basis is in the works.
>Why not put the effort into a great MMU
The people who are writing for OpenBSD can not "put their effort" into hardware changes to entire platforms.
They can do something to improve their own OS, however.
If it's just an interpreter, then the "sort-of executable" is just data being read, and the page can stay write-only.
We used to have that, they're called segments.
EDIT: There's a passing mention in a book 
> The Intel i960 extended architecture processor used a tagged architecture with a bit on each memory word that marked the word as a "capability", not as an ordinary location for data or instructions. A capability controlled access to a variable-sized memory block or segment. The large number of possible tag values supported memory segments that ranged in size from 64 to 4 billion bytes, with a potential 2^256 different protection domains.
EDIT: Now that I'm on a PC I've added the link here for you. Copy and paste of PDF's is a b on my mobile. ;)
There's some fascinating stuff in here. For example:
> 8.4 Object Lifetime
> To support the implicit deallocation of certain objects while preventing dangling references, the object lifetime concept is supported. The lifetime of an object can be local or global. "Local" objects have a lifetime that is tied to a particular program execution environment [...]. "Global" objects are not associated with a particular execution environment.
> Each job has a distinct set of local objects. No two jobs can have ADs [access descriptors] that reference the same local object. The processor does not allow an AD for a local object to be stored in a global object. Thus, when a job terminates, all the local objects associated with a job can be safely deallocated, and there cannot be any dangling pointers.
The machine was designed, to run Ada programs. Does this line up with how memory management works in Ada? It certainly resembles how it works in Rust!
Btw, since Im on mobile w/out links, type these into Google: capability computer systems book Levy; Army Secure Operatimg System ASOS. First is great book with many capability architectures like i432. Second is another HW and OS combo designed as secure foundation for embedded, Ada apps. It was interesting.
Do you perhaps mean "every allocation is its own address space"? I guess that would require pointers that are double the size of regular pointers (the first half pointing to the address space, and the second half being an index into that space).
These kind of protections are for widespread use protecting legacy codebases they don't want to straight up fix or take performance hit of tools like Softbound+CETS. Incidentally, tradeoffs like that usually fail. ;)
> eXclusive OR
Who would've thought.
The paper says that to bypass W^X protection, you can simply scan an executable for "the instruction you want to use, followed by a RET". The paper calls these "gadgets."
You can write any function you want by using these gadgets: simply call them. When you call a gadget, it executes the corresponding instruction, then returns. This allows you to write arbitrary functions, since real-world programs are large enough that they have a massive number of gadgets for you to choose from.
Can someone provide a counterargument?
What this paper describes is called ROP (return oriented programming). Theo has a different patch in OpenBSD to help mitigate that by re-linking the libraries at system startup, randomizing the location of most gadgets.
That's quite clever. By relinking I'm assuming that includes randomisation of the library objects? But that probably doesn't protect you if the program has a memory disclosure bug.
You can try return-oriented programming, but OpenBSD makes that hard: the stack is never executable, everything is position independent, objects get shuffled around inside libraries and executables¹, maybe more.
¹ At least this was proposed, and I think it landed in one of the newer releases, but I can't seem to find it for sure.
Found the slide that I remembered: https://twitter.com/justincormack/status/577005049374601217
Do you know if any of the rest of that is in-progress or complete?
* binutils 2.17 (last GPLv2) - complete.
* all archs transitioned to pie - complete as of 5.7.
* /sbin/init static pie - complete.
* ROP gadgets are being hunted down systematically, many nop sleds changed to int3/illegal instr.
* BROP - many base daemons now fork + exec, not all.
* SROP - mitigation committed - complete?
* shuffle - libc objects get shuffled at boot by rc(8), stack object order shuffling as a gcc extension.
After that, the performance overhead was pretty small on all benchmarks and websites I tested: Kraken and Octane are less than 1% slower with W^X enabled. On (ancient) SunSpider the overhead is bigger, because most tests finish in a few milliseconds, so any compile-time overhead is measurable. Still, it's less than 3% on Windows and Linux. On OS X it's less than 4% because mprotect is slower there.
i'm just kiddin ;)
In the traditional W(rite)|(or)E(xecute) model, the processes can both write and execute instructions in their address space in memory. I might be wrong but I think this could result in security leaks, because you cannot determine what the code evolves into at execution time.
W(rite)^(xor)E(xecute) model doesn't allow the processes to write AND execute instructions at the same time.
The wikipedia page states:
W^X requires using the CS code segment limit as a "line in the sand",
a point in the address space above which execution is not permitted and data is located,
and below which it is allowed and executable pages are placed.
For further information:
Disclaimer: I myself am a beginner in the field. Apologies for oversimplification and possible faulty assessments.
It probably helps that BEAM inherits a lot of Erlang semantics like immutability and process isolation; I reckon this eliminates a lot of the need for a given portion of memory to be both writable and executable at the same time.