Hacker News new | past | comments | ask | show | jobs | submit login
Rustler is a library for writing Erlang NIFs in safe Rust code (github.com)
168 points by bryanrasmussen 9 months ago | hide | past | favorite | 51 comments

I've been trying to decide between learning Elixir and learning Rust and Elixir had a narrow lead because of employment potential and my growing dissatisfaction with Docker (container orchestration contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Erlang).

Then I learned that the Achilles' heel of the BEAM VM is that if native code crashes it takes the whole thing with it, so people have been using Rust for native code (NIFs). So now the plan is Elixir then Rust which is easing my FOMO quite a bit.

> Then I learned that the Achilles' heel of the BEAM VM is that if native code crashes it takes the whole thing with it,

That's not really an 'Achilles heel' any more than most other languages that allow you to load C code that could segfault or - maybe even worse - get stuck in an infinite loop.

Indeed, the fact that BEAM focuses on and acknowledges this is an advantage for people trying to create robust systems. They've had workarounds in place for years, like running stuff in a separate system process.

This is actually the point of Rustler, rust can make guarantees about crashes and as such you can write safe low level code that never crashes the VM.

They ought to point out the notion of dirty NIFs a bit more prominently: https://bgmarx.com/2018/08/15/using-dirty-schedulers-with-ru...

Not crashing is good, but not sufficient to play nicely with the BEAM scheduler. You need to also ensure you're not using up a lot of time in some long-running operation, unless it's a "dirty NIF".

Yes. I write NIFs all the time, and it isn't crashing that's the problem for me (I'm a good programmer!) -- it's taking too much time. We have a NIF that does CUDA computations and we have to make sure we break them up right. Rust won't help me on this. I need C++ for CUDA.

1) I'm helping out with a project to bring some nice deep learning semantics to the BEAM. Can you share the CUDA nifs?

2) I'm considering taking up Zig as a better (well simpler, because my poor brain can't handle rust) safe language for NIFs. Do you think that zig might make a better fit as it can easily integrate C headers into itself?

3) are dirty nifs not ok?

We do plan to make the NIFs open source! Keep searching on github.

We use ours for signal processing, not "deep learning" so it may not be completely applicable.

I see, and I think I know how you could "fix" this. Use CUDA streams, use what's available to check whether queueing an operation in a stream would/could block. Last I read, they seemed to have around 2^11 entries in a presumably fixed ringbuffer).

Use this to at least notify the erlang side that it can't do this operation right now, either like send(2) via EAGAIN / EWOULDBLOCK, or, if possible, by implementing a "push-notification" from the CUDA scheduler to the BEAM IO subsystem to try again automatically (sort-of implementing CUDA stream enqueue operations as a custom type of IO) or base a poll(2)-style function (to be called by erlang code that then tries to re-submit or maybe propagate a ressource exhaustion conditon/timeout upstream) on this enqueue-possible notification.

The most important thing to support would be a way to allow the caller to not overfill the queue and allow an erlang thread to efficiently block until woken by a completion event in a CUDA stream.

I'm not sure, but I think the former could be done by manually accounting for queue fill-level in the erlang->CUDA library. The latter should be quite easy with CUDA stream callbacks inserted into a helper stream which got a "block for CUDA stream event $x" inserted right before the stream callback. The benefit would be that it's (mostly?) transparent to potential other Rust/C++ code interacting with the same CUDA streams for e.g. backend-fetching or handling/scheduling expensive CPU processing with tight integration into the CUDA operations.

Last I checked there might not have been an efficient way to syscall-block on CUDA stream events, so that one would have to manually trade between context switch overhead and latency because you'd have (had) to sched_yield(3) in between waking up to check if the stream is done or not. The only blocking wait call was a busy poll that would result in wasting the cpu core and preventing other threads/applications from using it while the GPU was busy going through the queued-up tasks/operations. I didn't try or at least don't remember trying whether stream callbacks could be (ab-)used for delivering progress information without actively/explicitely waiting on any CUDA stream(-event). It might be possible though.

It's definitely the "achilles heel" in the sense that the only real compelling feature of a weirdo language and environment like Erlang and BEAM is its use / capability in creating both scalable and highly fault tolerant systems.

If in fact the fault-tolerant premise is broken as soon as you need something outside the capabilities of the managed language, why learn a weirdo language? You might as well just use C# or Java instead.

The poster is writing about 'native' code, i.e. embedded code written in C/C++, Rust, etc. If you don't use 'FFI' or NIF code though, then error handling on the BEAM is still a remarkable thing.

It's definitely a more forgiving and powerful model than 'try/catch'.

> The poster is writing about 'native' code, i.e. embedded code written in C/C++, Rust, etc.

Yes, I know — so was I:

> … as soon as you need something outside the capabilities of the managed language …

You can also use Erlang port to communicate with external processes: https://elixirschool.com/blog/til-ports/

NIFs should be used when ports performance is not acceptable.

About deciding between Elixir and Rust you will benefit from learning both as Elixir/Erlang are more suited for fault tolerant distributed systems and Rust aims at becoming a safer C++. Also, Rust + wasm opens up a great toolset to build faster frontend apps.

If you follow the Elixir path first you'll benefit of learning some Erlang as well.

> Also, Rust + wasm opens up a great toolset to build faster frontend apps.

Hypothetically. It's yet to be seen (or full applications at least).

Very true.

The current frontend frameworks that exist are very immature and I wouldn't use them for anything other than small personal projects right now.

The current translation layer (going through JS to access the DOM) has an overhead, but is surprisingly small. There are no frameworks yet that really take advantage of this, though.

Things will become really interesting once wasm code can access the DOM directly without going through JS. (See the "Interface Types" wasm proposal)

But then again, the value of writing a frontend in Rust with it's verbosity and strict type system is very questionable when we have Typescript and the incredibly fast JS engines.

It could be valuable for large, long running apps. But those are commercial projects, which won't go for Rust (language complexity, lack of developers, productivity, ...)

I agree with most of your points, but will note a couple of things:

1) Tools like wasm-pack already generate typescript definitions for your WASM targeting Rust code - it seems like it shouldn't be too painful to write just a couple of particularly hot loops in Rust even if the bulk of your web app is in Typescript.

2) Multi-threaded JS is a non-starter, but we're already seeing experimental support for multi-threaded WASM. Admittedly, web workers let you write something similar to multi-process JS...

3) Less GC pressure, since WASM memory isn't garbage collected. While the scale of things you'd put in a website or typical web app really don't care about this, GC pauses in the single digits of milliseconds can start causing problems for games and other framerate sensitive applications if they cause vsync to be regularly missed.

4) Sharing code between a web demo and a "proper desktop app" via WASM targeting languages like Rust imposes a lot less constraints than a JS targeting language like Typescript (where you now need a full blown JS engine, if not a full blown browser engine ala electron, for the desktop version)

I wouldn't suprise me if WASM never gains widespread adoption for frontend development. But it also wouldn't suprise me if it gets used by a node module or two that eventually get popular, in an attempt to shave off a couple milliseconds, with end users being none the wiser.

That and you need to allow unsafe scripts in your CSP to support WASM in Chrome, so it's a bit of a killer for anything that takes security seriously.

Good chance that'll start to change as WASM gets more fleshed out and more commonly used, but that is a barrier to it's adoption at the moment.

What's stopping it exactly?

Nevermind found time to research it myself, this was a good overview https://github.com/WebAssembly/content-security-policy/blob/...

Chrome restricts it to unsafe-eval CSP. The main risk described seems to be escaping to the global object. Can't we use the power of a tracing GC to determine whether the global object is reachable from the import object, and put _that_ under unsafe-eval? Or add descriptions/tags that need to explicitly allow dangerous APIs to be reachable by the import object, and refuse if any undeclared dangerous APIs are reachable with a CSP not containing unsafe-eval?

I see both sides, but full unsafe-eval is too much of a wrecking ball when a warded lock[0] on this proverbial door to the WASM kingdom would be sufficient.

[0]: https://en.wikipedia.org/wiki/Warded_lock

Both FireFox and Safari support same origin CSP on instantiateStreaming sources though, which doesn't work on Chrome.

I actually mean even harsher restrictions than just some same-origin. Specifically, I mean using already-existing mechanics to figure out whether the global object is reachable by the WASM code. If so, treat as unsafe-eval for all I care. If not, treat as risky as the functionality exposed to the WASM code (testing reachability for other objects/functions that are not the global object itself).

> Then I learned that the Achilles' heel of the BEAM VM is that if native code crashes it takes the whole thing with it

People talk about "let it crash" in BEAM world, but advice I got that made understanding things a whole lot easier is to focus on recovering more than preventing crashes. I've found Elixir to be better at both not crashing and better at recovering. The abstractions for both just make more sense to me. That being said, I wouldn't have been able to understand those concepts before learning Elixir. So I say learn Elixir and OTP well, and decide for yourself.

There's not that much need for NIFs (it's clearly a needed feature, but it just doesn't come up that often), and sometimes you can do a port to an external program, or a c-node instead, so it doesn't feel like too much of a heel. And I say that as someone who contributed to a NIF that's still running at my last job. (And yes, I managed to take down the BEAM at least a few times with it; the most exciting of which was using cp instead of install to update the .so; not the right way to hotload)

> container orchestration contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Erlang

Is this Greenspun’s Eleventh Rule or something?

It's Virding's Law - kind of: "Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang."

Since in Erlang you should fail fast and noisily I think this Achilles heel is perhaps considered a strength.

A crash in a NIF circumvents the entire supervisor architecture that makes failing fast work well in Erlang/Elixir. The idea is that a crash never pulls down the entire thing, only the parts that are directly connected. So in the original telecommunication use case, you'd happily fail a single call if something unexpected happens, but that never should take down the entire telecom switch and all other calls with it.

A bad NIF crashes the entire VM and breaks the Erlang/Elixir model of failure handling, so you really should not put code that crashes in there.

Thanks for pointing out the error in my thinking!

How does Elixir handle OOMs, in such a case?

Do they fail the entire VM, too?

Yes mostly; when you get a failed allocation, for the most part there is nothing reasonable to do other than make a nice crash dump. Especially from the point of view of a language/toolkit.

As a operator/user, you can certainly write some code to scan for large processes, and kill them; with a fair bit of success, but that's not a great thing for the toolkit to do, it doesn't have a good way to tell what's too big or what's intended.

Nowadays in Erlang you can set a memory limit per process[1], ensuring they get killed if they try to allocate more. Of course you still have to estimate how much is reasonable which isn’t always obvious...

[1] See max_heap_size under http://erlang.org/documentation/doc-9.0/erts-9.0/doc/html/er...


Actually, that's not exactly what Joe Armstrong envisioned when he created Erlang, the system.

If you read his 2003[1] thesis, at page 36 he quotes a paper from Jim Gray (who worked at Tandem computers and was a Turing award winner)

    Although compiler checking and exception handling provided by programming languages are real assets, history seems to have favored the run-time checks plus the process approach to fault-containment. It has the virtue of simplicity—if a process or its processor misbehaves, stop it
and the on page 37

    The idea of “fail-fast” modules is mirrored in our guidelines for programming where we say that processes should only do when they are supposed to do according to the specification, otherwise they should crash.
and then on page 40

    Error handling is non-local.

    When we make a fault-tolerant system we need at least two physically separated computers. Using a single computer will not work, if it crashes, all is lost. The simplest fault-tolerant system we can imagine has exactly two computers, if one computer crashes, then the other computer should take over what the first computer was doing. In this simple situation even the software for fault-recovery must be non-local; the error occurs on the first machine, but is corrected by software running on the second machine.
So actually the VM crash is a signal: something's gone really bad, don't even try to recover, just let it die and let some other non-local process take over (the OS, some `forever` like script, a process on another machine...).

[1] http://erlang.org/download/armstrong_thesis_2003.pdf

You can't put long quotes in code formatting

I'd agree if it crashed the calling process, but crashing the entire VM is way too noisy of a way to go out.

Do you know other VMs that resist a crash in external, native, unchecked code?

A NIF is basically VM code that undergo some minor limitation and runs at full speed.

BEAM is not a sandbox and will (probably) never be.

The most important lesson Erlang teaches is that there could be no recovery from crashes from the same code that crashed.

That's why in OTP Supervisors are in the shape of a tree, each level is responsible for its children and only its children.

If you need control on some process tree, you start another supervisor to supervise it and so on.

The same goes for the VM: if you need to supervise the VM, you start an external process to check its status.

A crashed process is better than a process that misbehaves.

Non locality of error handling is the only way to be really fault tolerant.

Is there a VM that does not crash in case of a catastrophic event?

For example: what happens if I run this code?

    public class Main {
      public static void main(String[] args) {
        Object[] o = null;

        while (true) {
            o = new Object[] {o};
What happens if there is a bug in some JNI code that cause a segfault?

What Erlang programming style tries to teach is not to avoid crashes by being very defensive and trying to prevent every possible error condition (it's impossible), but to detect them by being fault tolerant and restart the process from a good state.

For an example of this library used in production: https://blog.discordapp.com/using-rust-to-scale-elixir-for-1...

Related: I've found that to run time-consuming number-crunching code from Elixir, the most convenient approach is to call a Python script via ports. It's super easy and can't crash your VM.

To convert from/to Erlang terms in your Python script, you can use a library like https://github.com/skosch/python-erlastic (a fork I maintain, the original project was abandoned).

    fn add<'a>(env: Env<'a>, args: &[Term<'a>]) -> Result<Term<'a>, Error> {
As someone not familiar with Rust, I'm curious what all the symbols there mean. What it's doing is of course obvious, just wondering about the details.

fn is analogous to function in javascript. It takes the form `fn fnname<generics, generics>(argname: argtype) -> ReturnType`

The "weird" 'a here is a generic lifetime. It basically says that the returned value (Term) must not outlive the input values (env and args).

A &[Term<'a>] is a slice of Term<'a>. You can think of a slice as if it was an array whose size is known at runtime.

A Result<Term<'a>, Error> is a type in the standard library used for error handling. It can either be an `Ok(Term)` or an `Err(Error)` depending on whether the function succeeded or not. (There is no special magic here, you can define your own type that can be "either" one or the other. They're called enums in Rust).

Just a quick re-contextualization. I wouldn't say

> "It basically says that the returned value (Term) must not outlive the input values (env and args)."

It doesn't say that it must not, it says that it doesn't. Because lifetime information can only be inferred about callees and not callers, the lifetime annotation tells the compiler what it needs to know about the caller's environment. It tells the compiler "the lifetimes of these two things are tied together in my caller". The compiler can then use that information when the borrow checker is doing its magic.

I've got an article written in my head about how to think about lifetimes in terms of visibility up and down the call stack.

The same `<'a>` name on all of the inputs and outputs tell the borrow checker that these things are related, i.e. they all borrow from the same thing.

The borrow checker tracks these annotations through all the scopes and function calls back to the "thing" they borrow from, and checks that nothing is borrowed for too long.

I think this particular case is trivial enough that the annotation could be omitted:

     fn add(env: Env, args: &[Term]) -> Result<Term, Error> {

but in more complex cases where data from some arguments is returned, and from some it is not, it's useful to annotate which inputs/outputs are related and which aren't, so that the unrelated arguments can be temporary values without tainting everything else as temporary.

The 'a stuff is marking the lifetimes. In this case saying that the return value in the happy case needs the inputs to last at least as long as the return value does before being freed because there might be a dependency between them.

`'a` is a "lifetime parameter"; syntactically, it's like a generic parameter, only it's use to denote how long a given reference stays alive. This is one of the important mechanisms that Rust uses to ensure memory safety without a GC. Every reference in Rust has a lifetime, but many of them can be elided.

`fn add<'a>` just means `define a function with a lifetime parameter called `'a`. This is analogous to if you defined a function as `fn add<T>` with a generic parameter T.

`Env<'a>` means that the type `Env` is parametrized by a lifetime. As before, this is similar to a type being parametrized by a generic parameter. This is usually done when a type contains a reference; Rust requires that references stored in a type must have explicit lifetimes specified. In this case, we're specifying that the lifetime that Env is parametrized over is the lifetime parameter that the function defines. This doesn't mean anything in particular on it's own, but when we use `'a` elsewhere in the signature, it means that the two places use the same lifetime, which is necessary to specify sometimes to clarify to the compiler that something is safe.

`&[Term<'a>]` is a slice of `Term<'a>` values; any type can go between the `[]`, e.g. `&[i32]` for a slice of 32-bit integers. A slice is a view into a sequence of values in memory. One of the things Rust does to aid the programmer in low-level optimizations is define separate types for sequences, namely slices, arrays, and vectors. A slice is a reference to another sequence type (which could be another slice); they're very cheap to make, but cannot be resized. An array is a value type (i.e. not a reference) and must be constant-sized. Finally, a `Vec` is resizable and heap-allocated, so it's expensive to create and copy (relative to slices).

Finally, `Result<Term<'a>, Error>` means that the function returns a `Result` (i.e. either a success or an error); if no error occurs, it will return a `Term<'a>; otherwise, it will return the type `Error` (which is generally custom-defined in a given package, but could be an instance of a standard library error like `std::io::Error`. This is roughly analogous to returning a `Term<'a>` and marking that an exception of type `Error` may be thrown.

One thing to note is that Rust's return values with lifetimes tend be tied to one of the parameters. In this case, the compiler's rules aren't sufficient to infer which of `Env` and `Term` to tie it to, so the programmer needs to disambiguate. The programmer chose to tie them all together, which presumably indicates that the output will reference both parameters in some way (although it's impossible to figure out more precisely how it will from the signature alone).

Lifetimes are definitely the hardest part of Rust in my opinion; the important thing to realize is that changing the lifetime parameters in a function will never change the user-visible semantics of the code. If the lifetimes are invalid, the compiler will error, and if they are valid but not optimal, then the worst case is that that memory might be kept around longer than needed. Unless you're programming something that is super performance critical or will be long running, you generally won't have huge issues if you fix the lifetime compiler errors with trial and error when they come up (which is much less often than you might expect due to the compiler being able to infer them most of the time).

> the important thing to realize is that changing the lifetime parameters in a function will never change the user-visible semantics of the code

To reiterate: the lifetime parameters are used only to check the code for errors, they are discarded after that and don't affect the code generation. There's even an alternative compiler for Rust (mrustc) which completely ignores them and the rest of the borrow checking stuff, assuming the code is correct.

Does anyone know of an example of a rust library or function that’s been turned into a NIF? Ideally available on GitHub both before and after the conversion?

Also, the “nerves” framework is basically 80% NIFS of embedded C code.

Nerves overall has a much smaller portion of C nifs than 80%, probably < 10%. Mainly for properly trapping and cleaning up subprocesses. There’s some nifs too setup Linux specific sockets/ioctls. Still surprises me that almost every Linux kernel interface essentially overloads `ioctl`.

Not sure of an example. Really just wrap any rust function with appropriate code to convert Erlang terms to rust and back... There’s a few elixir rustler howtos: https://medium.com/@jacob.lerche/writing-rust-nifs-for-your-...

You could check the list of packages depending on Rustler


As a professional Erlang developer I don't trust libraries that are heavy on NIFs, and Rust isn't going to change this. Key feature of BEAM VM is process preemption. It makes NIF development harder than it looks, and typically performance of NIF-based libraries looks best in synthetic benchmarks.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact