I've always been in favor of all OpenBSD security enhancements I've seen, but I have to say, and please hear me out, this is an objectively terrible idea.
Yes, most programs should disallow W|X by default. But trying to banish the entire practice with a mount flag, knowing full well few people will go that far to run a W|X application, is bad practice. I'd rather see this as another specialty chmod flag ala SUID, SGID, etc. Or something along those lines. One shouldn't have to enable filesystem-wide W|X just to run one application.
The thing is, when you actually do need W|X, there is no simple workaround. Many emulators and JITs need to be able to dynamically recompile instructions to native machine code to achieve acceptable performance (emulating a 3GHz processor is just not going to happen with an interpreter.) For a particularly busy dynamic recompiler, having to constantly call mprotect to toggle the page flags between W!X and X!W will impact performance too greatly, since that is a syscall requiring a kernel-level transition.
We also have app stores banning the use of this technique as well. This is a very troubling trend lately; it is throwing the baby out with the bathwater.
EDIT: tj responded to me on Twitter: "the per-mountpoint idea is just an initial method; it'll be refined as time goes on. i think per-binary w^x is in the pipeline." -- that will not only resolve my concerns, but in fact would be my ideal design to balance security and performance.
Would you really have to call mprotect on OpenBSD?
When SELinux is enforcing this on Linux, you don't. The trick is to map the page twice, once for writing and once for executing. Assuming both mappings are ASLR randomized, it's actually safer than allowing mprotect.
SELinux actually has OpenBSD beat on this one: in some configurations, written pages can not be made executable.
> One shouldn't have to enable filesystem-wide W|X just to run one application
This is a very good point, and it seems like enabling it more granularly would be worth it.
But, for your second point, I have trouble believing that switching between W!X and X!W is going to happen frequently enough to be substantial for performance. (It was very small in Firefox's JavaScript JIT, causing something like a 1%-4% performance hit depending on system).
An emulator is a very different use case than Firefox's JIT.
Take Dolphin (the Gamecube / Wii emulator) for example: you have a 4GiB DVD, it has a massive 80-hour game's entire code engine on there. You cannot recompile the entire disc at startup. Even if you could (you really can't), these games tend to push code into RAM to execute, which a static recompiler cannot handle.
The way dynamic recompilers handle the tremendous burden is to recompile small blocks at a time, and track which ones are hot and cold. When the buffers fill up, they start dropping old, stale code.
It's hard to say exactly what the performance impact would be, and it'd probably vary per game title. But it'd be a lot worse than a web browser recompiling jQuery plus another 200KiB of custom Javascript once on page load.
Plus, I am sure there are many more uses cases for W|X than just emulators and JITs. It would be a shame to try and eradicate them all from existence.
Browser JITs don't just recompile things on page load. They have to add inline cache entries whenever the existing IC is missed.
Now in practice ICs usually hit (that's the point!) so once you've been running for a bit things should hopefully not need more recompiling. Which is a significant difference from the situation you describe.
How often is too often? When I output compilation logs for HotSpot, it's almost never not compiling (in terms of human perception of how fast the messages are output).
If it's writing code into memory, and you're going to compile it and rewrite the jump when it's done, isn't the part where the current code is writing in W mode already?
The process would be:
1. compile the code when you fault on the basic block exit.
2. mark that basic block executable.
3. Optionally only after return or other jump: mark the jumping basic block as writable, patch the jump, then mark it as executable.
In this case, there would be three changes, JS JITs are doing this a lot more often than Dolphin is. They often have more than one level of JIT in addition to an interpreter; so they will end up doing this dance more than once per basic block.
Until I see at least some microbenchmarks and concrete estimates, I don't think I'll worry too much about this. Though it is unfortunate to have to modify all of this code.
An emulator is a particular niche application, and a prime example of an exception for which one could enable W|X - acknowledging that most apps don't need W|X doesn't mean that W|X apps would be eradicated.
> One day far in the future upstream software developers will understand that W^X violations are a tremendously risky practice and that style of programming will be banished outright.
Whoever wrote that they should be banned outright, I feel is being very short-sighted. It would be like banning cars because it's possible to seriously injure a person with one.
tj later replied to me on Twitter saying they meant for it to become per-process in the future. Once that happens, I'll be okay with this change. Right now, I think filesystem-level is far too broad. I often will want an emulator on the same filesystem as I want W^X protections for other applications on.
It's different from your typical language interpreter type JIT in that it's used as an optimisation for generating an optimal processing kernel. Once generated, it's only used once before it's thrown away, as a different input requires a different processing kernel.
Have actually tested using syscalls (not for W^X but rather as a potential workaround for newer Intel CPUs exhibiting weird SMC detection) and have found the overhead to be way too much, even for just one syscall (whilst switching between W/X would require two).
OpenBSD's policy is "secure by default", meaning that users/administrators should have to go out of their way to make their systems insecure; I'm personally quite alright with the handful of non-W^X-compatible programs no longer running if it means having W^X globally-enabled by default.
You're right that per-binary or per-file might be better in terms of providing a fine-grained approach, but on the other hand, I'd very much like to enforce this on a machine-wide basis on my workstations and servers, and having per-filesystem settings seems to be the most effective way to go about this, and I hope that remains an option (similar to OpenBSD's "nosuid" and "noexec" mount options).
Could they not have done something like how paxctl works on Linux? Such as globally enabling it but allowing for application specific control if you have to disable it (either through paxctl/xattr's or some policy file (rbac))?
To me that seems to make more sense than a global mount flag but I admit I'm not that knowledgeable about OpenBSD stuff. I suppose it's better late than never considering we've had this stuff in PaX since 2000[1].
For those heading into the comments to know what this is about: W^X is a protection policy on memory with the effect that every page in memory can either be written or executed but not both simultaneously (Write XOR eXecute). It can prevent, for example, some buffer overflow attacks.
I guess I've never understood this. The W^X is a response to one of the classic Multics paper attacks, but doesn't actually work well in a world where JavaScript, JVM byte code, and other non-native code is "sort-of executable". Why not put the effort into a great MMU so that every allocation is its own "page" for the purpose of triggering page faults on overruns?
It seems to have been the i960MX and i960MC which had the funky object-oriented memory; they have long since been discontinued, and i can't find any documentation about them online, sadly.
EDIT: There's a passing mention in a book [1]
> The Intel i960 extended architecture processor used a tagged architecture with a bit on each memory word that marked the word as a "capability", not as an ordinary location for data or instructions. A capability controlled access to a variable-sized memory block or segment. The large number of possible tag values supported memory segments that ranged in size from 64 to 4 billion bytes, with a potential 2^256 different protection domains.
Oh, brilliant, thanks. I suspect this evaded my searches because it doesn't say '960mx' in any machine-readable way!
There's some fascinating stuff in here. For example:
> 8.4 Object Lifetime
> To support the implicit deallocation of certain objects while preventing dangling references, the object lifetime concept is supported. The lifetime of an object can be local or global. "Local" objects have a lifetime that is tied to a particular program execution environment [...]. "Global" objects are not associated with a particular execution environment.
> Each job has a distinct set of local objects. No two jobs can have ADs [access descriptors] that reference the same local object. The processor does not allow an AD for a local object to be stored in a global object. Thus, when a job terminates, all the local objects associated with a job can be safely deallocated, and there cannot be any dangling pointers.
The machine was designed, to run Ada programs. Does this line up with how memory management works in Ada? It certainly resembles how it works in Rust!
Ada was designed for environments where dynamic allocation wasn't allowed. So, it doesn't have safe scheme for that. It offers regions, GC, or fast/unsafe deallocation. This is one area where Rust has dramatic improvement. I agree i960 is reminiscent of schemes like Rust. Very forward thinkjng.
Btw, since Im on mobile w/out links, type these into Google: capability computer systems book Levy; Army Secure Operatimg System ASOS. First is great book with many capability architectures like i432. Second is another HW and OS combo designed as secure foundation for embedded, Ada apps. It was interesting.
> Why not put the effort into a great MMU so that every allocation is its own "page" for the purpose of triggering page faults on overruns?
Do you perhaps mean "every allocation is its own address space"? I guess that would require pointers that are double the size of regular pointers (the first half pointing to the address space, and the second half being an index into that space).
There's already work on that sort of thing with crash-safe.org and Cambridge's CHERI. SVA-OS by Criswell et al to a degree.
These kind of protections are for widespread use protecting legacy codebases they don't want to straight up fix or take performance hit of tools like Softbound+CETS. Incidentally, tradeoffs like that usually fail. ;)
FWIW Firefox manages to use W^X in their JIT (the experimental patch back in 2011 had a very interesting idea: one process w writing out to memory mapped RW, another process with the same mapped executable): http://jandemooij.nl/blog/2015/12/29/wx-jit-code-enabled-in-...
Further, XOR is an exclusive OR. In boolean logic, this means that the two values MUST be different. So 0 ^ 0 would 0, and 1 ^ 1 would also be 0. However, 1 ^ 0 would be 1.
The paper says that to bypass W^X protection, you can simply scan an executable for "the instruction you want to use, followed by a RET". The paper calls these "gadgets."
You can write any function you want by using these gadgets: simply call them. When you call a gadget, it executes the corresponding instruction, then returns. This allows you to write arbitrary functions, since real-world programs are large enough that they have a massive number of gadgets for you to choose from.
W^X isn't a solution for all memory corruption bugs. Only some.
What this paper describes is called ROP (return oriented programming). Theo has a different patch in OpenBSD to help mitigate that by re-linking the libraries at system startup, randomizing the location of most gadgets.
> What this paper describes is called ROP (return oriented programming). Theo has a different patch in OpenBSD to help mitigate that by re-linking the libraries at system startup, randomizing the location of most gadgets.
That's quite clever. By relinking I'm assuming that includes randomisation of the library objects? But that probably doesn't protect you if the program has a memory disclosure bug.
This is ROP, and W^X never was intended to stop this. Here's a link to a mailing list post that predates that paper by 3 years, which says exactly that. [1]
Yeah. OpenBSD compiles everything as PIC, has ASLR, they were considering randomizing the locations of objects within libraries and executables (I think this was or is being implemented), as well as limited what you can do with a ROP exploit in the first place.
You need to have a memory disclosure bug (they're very common). You don't even need to be able to read all of memory, you just need to know the start offset of where the library is loaded (if the library is from a distribution you can download it and figure out the ROP offsets using the library -- though apparently OpenBSD has defences against that too).
You've just described ROP chains. While ROP chains allow you to bypass W^X they are not without their limitations. First of all, you need to have some sort of memory disclosure of the target process in order to figure out what gadgets exist. Secondly you need to have significant control of the stack (lots of empty space) to create a full ROP chain. Thirdly, not all processes have enough libraries loaded for you to do a nice ROP chain (I believe glibc gadgets are Turing complete but I think some other libcs are not).
Does this mean to successfully exploit a program I need to write to an area in memory that the program will later turn the page in memory to "Execute"?
Yep! That's the idea here: it makes it less likely to accidentally treat data as instructions. If you can figure out some bug in a JIT compiler to get it to write your payload to a page that will get turned executable, then you still have an exploit, but the attack surface is smaller.
You can try return-oriented programming, but OpenBSD makes that hard: the stack is never executable, everything is position independent, objects get shuffled around inside libraries and executables¹, maybe more.
¹ At least this was proposed, and I think it landed in one of the newer releases, but I can't seem to find it for sure.
After that, the performance overhead was pretty small on all benchmarks and websites I tested: Kraken and Octane are less than 1% slower with W^X enabled. On (ancient) SunSpider the overhead is bigger, because most tests finish in a few milliseconds, so any compile-time overhead is measurable. Still, it's less than 3% on Windows and Linux. On OS X it's less than 4% because mprotect is slower there.
Presumably that's why GCC, JDK, Mono, Chrome, etc are all in violation? Modifying your JIT to flip-flop between 'W' and 'X' seems like the naive fix but I imagine it ruins whatever hotspot tracing that may also be local to the JITted code.
NetBSD is going through some similar security moves currently (extending PaX[0]), and iiuc, there are special considerations required for Java/jvm, because of the bytecoding process. Does anybody know if my understanding is correct (that a page will have to be both writable and executable) and if so, what are OpenBSDs considerations for this ?
I dunno why, but this quote from Benjamin Fraklin came to m mind - “Those who surrender freedom for security will not have, nor do they deserve, either one.”
I am not an expert on the field. But | means or and ^ means execute.
In the traditional W(rite)|(or)E(xecute) model, the processes can both write and execute instructions in their address space in memory. I might be wrong but I think this could result in security leaks, because you cannot determine what the code evolves into at execution time.
W(rite)^(xor)E(xecute) model doesn't allow the processes to write AND execute instructions at the same time.
The wikipedia page states:
W^X requires using the CS code segment limit as a "line in the sand",
a point in the address space above which execution is not permitted and data is located,
and below which it is allowed and executable pages are placed.
This means using W^X is an improvement for security, but it's hard to adapt to from a technical viewpoint and limits your computer's ability in meta-programming.
I haven't tried it on -current yet, but I don't recall ever hearing about BEAM (Erlang's VM) being affected by this.
It probably helps that BEAM inherits a lot of Erlang semantics like immutability and process isolation; I reckon this eliminates a lot of the need for a given portion of memory to be both writable and executable at the same time.
Yes, most programs should disallow W|X by default. But trying to banish the entire practice with a mount flag, knowing full well few people will go that far to run a W|X application, is bad practice. I'd rather see this as another specialty chmod flag ala SUID, SGID, etc. Or something along those lines. One shouldn't have to enable filesystem-wide W|X just to run one application.
The thing is, when you actually do need W|X, there is no simple workaround. Many emulators and JITs need to be able to dynamically recompile instructions to native machine code to achieve acceptable performance (emulating a 3GHz processor is just not going to happen with an interpreter.) For a particularly busy dynamic recompiler, having to constantly call mprotect to toggle the page flags between W!X and X!W will impact performance too greatly, since that is a syscall requiring a kernel-level transition.
We also have app stores banning the use of this technique as well. This is a very troubling trend lately; it is throwing the baby out with the bathwater.
EDIT: tj responded to me on Twitter: "the per-mountpoint idea is just an initial method; it'll be refined as time goes on. i think per-binary w^x is in the pipeline." -- that will not only resolve my concerns, but in fact would be my ideal design to balance security and performance.