
WasmBoxC: Simple, Fast, and VM-Less Sandboxing - syrusakbary
https://kripken.github.io/blog/wasm/2020/07/27/wasmboxc.html
======
mdriley
> Wasm sandboxing is even safe to run in the same process as other code (at
> least modulo Spectre-type vulnerabilities)...

If you want strong security with this big a modulus you may have been looking
for RSA. (ugh, sorry)

Spectre V1 (speculative bounds check bypass / type confusion) is basically
game over for intra-process memory isolation without introducing expensive or
complicated mitigations: speculation barriers at _every_ branch, or a more
optimized pass like Speculative Load Hardening
([https://llvm.org/docs/SpeculativeLoadHardening.html](https://llvm.org/docs/SpeculativeLoadHardening.html))
that can bring the overhead _down_ to 20-50%.

From what I've seen, the best understanding right now is that the process is
the smallest defensible unit of isolation.

That's not to say that using WASM as an intermediate compilation target won't
eliminate some threats! The runtime can definitely help to improve memory
safety, for example. But if you care about _confidentiality_ of the data in
your process, any code that is untrusted enough to require sandboxing should
also be run out-of-process, e.g. across an IPC boundary.

~~~
KMag
Spectre makes me wish we'd give something like IBM's project DAISY /
Transmeta's Crusoe another chance. If we have one mode with a simple and high-
density instruction set similar to Hitachi's SH4 / ARM's Thumb2, along with a
VLIW mode (similar to switching between Thumb2 and regular ARM), and some
hardware support for tracing and dynamic recompilation (including reservoir
sampling of instructions causing pipeline stalls), we might get decent
performance and code density, while being able to move all of the speculation
into the dynamic recompilation layer.

Having the processor natively support the dense non-VLIW instruction set means
you don't pay much in the way of startup latency, and once you're warmed up,
your hot spots have all of their cross-DLL calls inlined / no longer indirect,
and your virtual function calls devirtualized / speculative inlined. Spots
where the dynamic recompiler was wrong about pipeline stalls eventually use
dynamic information to recompile and shuffle instructions around to avoid
pipeline stalls. You don't pay the transistor and power budget for out-of-
order execution or speculative execution hardware. Hardware speculative
execution gets replaced with predicated instructions and shadow register
save/restores.

I think IBM's DAISY was much closer to being on the right track vs. Intel's
EPIC / Itanium. If your system is built for dynamic recompilation and re-
optimization of native code on the fly, your compiler doesn't have to be as
good at statically predicting execution paths and pipeline stalls.

Transmeta's main problem was that they were emulating all x86 instructions
before they warmed up, and after warm-up they were still emulating all x86
instructions outside their hot code path. With a simpler instruction set like
SH4 / Thumb2, they could hopefully have hardware support for the code outside
of the hot spots / pre-warmup.

Given the amount of time spent running JavaScript, hopefully one would also
look at the bytecodes for V8, JavaScriptCore, and SpiderMonkey for inspiration
as far as making the non-VLIW instruction set an efficient and compact
JavaScript JIT target.

~~~
titzer
> If we have one mode with a simple and high-density instruction set similar
> to Hitachi's SH4 / ARM's Thumb2, along with a VLIW mode (similar to
> switching between Thumb2 and regular ARM), and some hardware support for
> tracing and dynamic recompilation (including reservoir sampling of
> instructions causing pipeline stalls), we might get decent performance and
> code density, while being able to move all of the speculation into the
> dynamic recompilation layer.

Trading hardware complexity for software complexity is not the right strategy
to combat sidechannels. They just reappear up the stack.

Any speculative optimization that depends on program values can potentially
leak information through timing.

~~~
nine_k
With speculation moved to a (JIT) compiler, this becomes way harder to
trigger.

You can no longer deterministically trigger a speculative fetch and detect
whether it was slow or served from cache. Whether the speculative fetch ever
occurs is deteremined by the JIT, and once the JIT realized it should not even
occur, it will never occur again. Quite likely you won't be able to collect
enough bits _even_ if speculative execution is supported by the hardware. It
it's not, you will just not have anything to time, AFAICT.

~~~
titzer
It's true that it is much, much harder to trigger, and less predictable, but
the information leak is still there. Both the bitrate and the signal-to-noise
ratio are far worse.

It's worth pointing out that JITs can and do get caught in deopt loops, and it
doesn't even necessarily need to be a deopt loop in a JIT that is the
information leak. Something as simple as interned strings can leak information
about what strings a program is using. Also with hashtables; if you have
control over some of the keys that go into a hashtable and some knowledge of
its implementation, timing information can reveal information about hash
collisions, and thus other keys in the table.

Fundamentally, side channels are unwanted information flows in a system, and
the more complicated the system, the more potential there is for side
channels. Moving complexity around might make a difference on the reliability
and bandwidth, but it doesn't eliminate them. If side channels are a serious
concern, the best defense is simplicity, not massive rearchitecting to a new
and different kind of complexity.

(even speaking as a person who spent decades working on JITs--wrong hammer
here)

~~~
KMag
My understanding is that Spectre/Meltdown attacks are all the result of either
conditional memory operations conditioned upon speculatively read data, or
else speculative data operations using addresses calculated from data that was
speculatively loaded.

Is that correct? Doesn't moving speculation from hardware control to software
control allow you to perform better dataflow analysis than you can reasonably
do in hardware, which allows you to perform more provably safe speculations
(or less often pay the I/O overhead of converting a speculative conditional
load into an unconditional load and a speculative conditional register-to-
register move) than you get if the speculation decisions need to be performed
in hardware?

------
legulere
> By compiling to wasm we sandbox the code, preventing it from accessing
> anything on the outside.

Operating systems are in a sad state, as virtual address spaces already offer
exactly that in hardware at full speed. It's only through the operating system
APIs that processes gain the ability to affect anything outside of the
process.

Current operating systems weren't made with untrusted code in mind, but have
so much inertia that new operating systems repairing old misfeatures can never
succeed. All code is already written based on the bad Operating System APIs
like for writing files. This project just handwaves the actual problem away:

> the sandboxed code can’t do anything but pure computation, unless you give
> it a function to call to do things like read from a file, tell the time,
> etc.

~~~
james412
> Operating systems are in a sad state, as virtual address spaces already
> offer exactly that in hardware at full speed

Ever since reading about Microsoft Singularity all that time ago, I reached
the opposite conclusion: software is in such a sad state that it must rely on
hardware to provide isolation. From this perspective, WasmBoxC and many
projects like it are IMO a huge step in a desirable direction.

The main lesson from Singularity for me (aside from requirements for memory
safety) was that cross-component safety can be achieved by formalizing the
protocols those components use to communicate. Sing# had a dedicated type to
capture the state machine for every cross-component transaction, with strongly
typed inputs and outputs. This problem is not unique to software isolation --
it is the basis for a huge variety of security problems everywhere across the
ecosystem, not least network services

Wouldn't it be a wonderful world if we knew our application was fully safe
when exposed to a network for the same reason we know it is fully safe to run
in the same address space as another untrusted application? That is that path
Singularity took us along

~~~
pjmlp
Example of such sad state of affairs, Android 11 is adding support for
hardware memory tagging, as static analysis alone is not enough to tame the C
and C++ components.

[https://source.android.com/devices/tech/debug/tagged-
pointer...](https://source.android.com/devices/tech/debug/tagged-pointers)

iOS, Solaris on SPARC are on this path as well.

Regarding Singularity, we are slowly moving away from C on non pure UNIX
clones, but still it will take generations.

------
pjmlp
Well,

"Everything Old is New Again: Binary Security of WebAssembly" \- USENIX 2020

[https://www.usenix.org/conference/usenixsecurity20/presentat...](https://www.usenix.org/conference/usenixsecurity20/presentation/lehmann)

~~~
azakai
That's a good article! Some notes on it:

[https://twitter.com/kripken/status/1284576787624648705](https://twitter.com/kripken/status/1284576787624648705)

Perhaps wasm isn't (yet) a good replacement for native binaries for the
reasons they mention. In particular such an application usually has access to
files and timing etc., and you're (currently) missing some safety techniques
native binaries use, which is a risky combination.

But as mentioned in the post here, if you're sandboxing a specific library
that does pure computation (say, a codec or a compression library) then using
wasm you can make sure it cannot escape the sandbox and that it has no timing
or other OS capabilities. Those are powerful guarantees!

~~~
pjmlp
Pure computation is subject to UB and memory corruption due to lacks of bounds
checking inside linear memory blocks, leading to outputs that cannot be
trusted, even though they are sandboxed.

~~~
azakai
Definitely, yes, and you do need to be careful about those outputs. Still, the
sandboxing guarantee here is very useful!

This isn't theoretical, Firefox does this approach in production (using RLBox,
which is mentioned in the post),

[https://hacks.mozilla.org/2020/02/securing-firefox-with-
weba...](https://hacks.mozilla.org/2020/02/securing-firefox-with-webassembly/)

~~~
pjmlp
Thanks for the link, I missed that post.

------
jasonzemos
> The OS-based implementation uses the “signal handler trick” that wasm VMs
> use. This technique reserves lots of memory around the valid range and
> relies on CPU hardware to give us a signal if an access is out of bounds
> (for more background see section 3.1.4 in Tan, 2017).

On Linux you can also make use of userfaultfd(2) rather than handling SIGSEGV.

------
lostmsu
This is a really cool concept. I'd love a mature supported technology, that
could replace .NET's AppDomain.

~~~
merb
biggest problem is that .net/dotnet core can not be easily wasm aot compiled.
currently you compile the runtime to wasm and use the dll with the wasm
runtime. this is the approach that blazor wasm uses and it is aweful.

~~~
ckok
I tried this. Aot compiling with the .net runtime just isn't very practical. I
know mono does this, with some guided info for reflection based classes. But
the way the structure is setup, is that string pulls in globalization. And
enumeration. And comparable and equality which both work by reflection to pick
the right default implementation. Reflection is expected to work, which pulls
in all methods and their dependencies. By the time you have a working
executable for hello world, you're a few MB ahead. The class library just
isn't setup for this kind of use.

~~~
GordonS
I was thinking about this just the other day, when Microsoft announced the
latest dotnet 5 preview, and yet again have postponed AOT - Microsoft have
been teasing dotnet devs with the promise of production-ready AOT for
something like a _decade_.

Frankly, I wish they'd put up or shut up - either prioritise it and make it
happen, or just admit defeat and say it's not going to happen.

~~~
pjmlp
Production ready AOT for .NET exists since Singularity, which MIDL and Bartok
compilers were the basis of .NET WinRT on Windows 8/8.x.

Then some of the Midori tech eventually made its way into .NET Native, which
from my point of view UWP + .NET Native is what .NET should have been all
about back in 2001.

Apparently after the timid attempt with XAML Islands and MSIX, they seem to be
getting the house in order and driving the platform into a way to pretend that
Windows 8 and 8.1 never happened, but it seems to be lacking a lot of
coordination and long term planning.

Windows 10X apparently is also not getting Win32 sandbox any longer, this
assuming it ever gets released.

Still in the middle of the chaos, it still feels much better than if I had to
deal with Android on daily basis, one IO best practices is next years legacy.

~~~
kevingadd
I think when someone says "production ready AOT" in any context, it's never
quite obvious what they mean. As ckok implied, even when AOT is "working" it
may generate a 100mb executable. For some end users that is production ready
(like Facebook, who reportedly were shipping 1gb+ AOT'd php executables to
their server cluster) and for other end users it is not (because a 100mb
browser app is a closed tab.)

WASM now and emscripten before it both were designed and optimized for POSIX C
apps and for games, and they're pretty good for those scenarios. Larger-scale
stuff is pretty tricky and the tooling ecosystem will have to continue to grow
to support more real-world applications. JIT, stackwalking, GC, etc are all
still not there on WASM - thankfully threading is finally crossing the finish
line but even that has taken years.

~~~
pjmlp
Some day WebAssembly will match what Flash CrossBridge and PNacl already had
10 years ago.

------
syrusakbary
Alon did a really great job with the article, hats off.

I think WasmBoxC might be useful to try to benchmark how fast we can get Wasm
to run server-side. We also hope to eventually beat native execution in
server-side specific Wasm runtimes (such as Wasmer using the LLVM compiler and
Profile Guided Optimizations) once the runtime ecosystem matures a bit.

Keep up the good work!

------
brianolson
Gary Bernhardt was prophetic [https://www.destroyallsoftware.com/talks/the-
birth-and-death...](https://www.destroyallsoftware.com/talks/the-birth-and-
death-of-javascript)

~~~
esperent
Wasm is amazing, but it's not intended to replace JavaScript. If you just want
to add a form to a website, animate a drop-down menu, or the like, JS (or TS)
will probably always be preferable to writing in another language and
compiling to wasm. The web platform moves fast and maybe in a couple of years
I'll eat these words. But nothing I've seen so far points towards the demise
of JS.

~~~
ronjouch
Gary's talk is not exactly about the "demise of JS". Watch the talk, it's a
great one :)

------
devwastaken
Any time these come up i don't find a comprehensive analysis on security. The
best, supposedly most secure libraries in the world get significant
vulnerabilities.

~~~
tpetry
This is not make any software magically secure. It‘s only purpose is to
_sandbox_ a library to prevent it doing anything malicious. As wasm cant do
any more than computing you simply contain the library, it cant anymore open
any files, make network calls, etc.

~~~
wahern
> As wasm cant do any more than computing you simply contain the library, it
> cant anymore open any files, make network calls, etc.

That presumes the WASM implementation is bug free, which is not a great
presumption, especially for the more sophisticated implementations with JIT
engines, and even more so for those adding multi-threading, GC, etc.

------
pansa2
So, compiling via WasmBoxC gives a 14% - 42% performance overhead compared to
native compilation.

Is there also a significant overhead in terms of binary size?

~~~
kevingadd
wasm binaries are typically size-competitive with native x86 _once you
compress them_ (i.e. foo.so.zip vs foo.wasm.zip), but the jitcode you get out
of them is usually much bigger. In my testing the size overhead for stuff like
the ICU unicode library was maybe 20-40% on-disk, depending on compiler
settings. It's gonna depend on your workload.

Note that even if the actual generated jitcode is computationally efficient,
it might be larger and as a result waste more space in the instruction cache,
which would hinder performance.

------
sriku
Sounds similar to Fastly's "Lucet" \- [https://www.fastly.com/blog/announcing-
lucet-fastly-native-w...](https://www.fastly.com/blog/announcing-lucet-fastly-
native-webassembly-compiler-runtime)

------
Chris2048
so, if you compiled a language shell, or interpreter this way; you would end
up with a sand-boxed shell/interpreter?

~~~
ncmncm
One that couldn't do anything. The essence of the method is providing to the
sandbox specific, safe means to interact with the surrounding process.

