
Show HN: Rembulan, an implementation of Lua 5.3 for the JVM - sandius
https://github.com/mjanicek/rembulan
======
marktangotango
This is an impressive project. There were a handful of Lua 5.1 vms implemented
in java iirc, looks like this project compiles lua 5.3 to either Java or just
bytecode? Also impressive that there's a lua parser as well, as some of the
old Java lua impels required the lua compiler to generate bytecode. I'm also
impressed you've implemented cpu accounting. The other thing you need for 100%
sand boxing is heap limits, and stack limits, is there any way to do that in
this impl? Finally, what is the intended use for this project ie why did you
write it?

~~~
sandius
Thanks!

Yes, indeed: there is LuaJ and Kahlua, but neither is able to do Lua 5.3 (they
both support 5.2). I actually started by trying to update LuaJ to 5.3, but in
the end I found myself disagreeing with some of the fundamental design choices
(such as the mapping of Lua coroutines to Java threads), and decided to start
from scratch. Lua is a small and well-designed programming language, so it
seemed doable. In the process I made some controversial design choices myself
:)

Rembulan compiles Lua 5.3 sources to Java bytecode directly -- it's easier
(and probably faster) to do it that way, rather than compiling to Java and
then using javac to get the bytecode. For instance, there is no "goto" in the
Java programming language, but the Java bytecode does have unconditional
jumps. Besides, Rembulan needs to be able to suspend a Lua call at almost any
point, and restore it later. So even if I went via generating Java sources
first, those sources would be very human-unfriendly (not readable or useful).

Writing a Lua parser + compiler is actually quite straightforward. Everything
you need -- both syntax and semantics -- is in described in the Lua Reference
Manual. I used to have a PUC-Lua bytecode loader and compiler (i.e., a
recompiler that would read PUC-Lua bytecode and emit Java bytecode), but it
was for bootstrapping only -- once I've verified that my compiler covered the
entire language, I got rid of it. That said, I think neither LuaJ and Kahlua
actually requires PUC-Lua. LuaJ used a parser/compiler that looks like it was
ported from C (i.e., it looks suspiciously similar to the PUC-Lua sources).
But it generates PUC-Lua bytecode, not Java bytecode directly, that's a
separate (optional) step.

Regarding sandboxing: Heap limits are tricky, since the JVM doesn't give you
direct control over the GC, and as far as I'm aware it isn't possible to
"segment" the JVM's heap into smaller chunks. So it seems that the only way to
do this is to do some bookkeeping, and keep track of new tables (and new table
entries), coroutines and userdata, updating the "heap counter" once they are
GC'd. (Plus of course throwing an exception once the limit has been exceeded.)
I haven't worked on that yet, but the infrastructure should be more or less
ready for this kind of approach.

Stack limits are easier: increment a counter on (non-tail) call, decrement it
on return, throw an error if the limit is exceeded. I haven't implemented it
yet, though.

I think Rembulan might work quite well in a backend of a massive multiplayer
game, especially one that the players may program (to a degree) themselves.
Something along the lines of Grobots, but programmed in Lua instead of Forth
:)

~~~
marktangotango
Ah ok, this use case of executing Untrusted code is really common. I'm really
surprised there isn't a "real" solution today. Like with your project here,
constraining the heap used by the jvm is impossible unless you implement some
sort of accounting system as you said. Another option is to implement a
language that disallows dynamic memory allocation and requires all memory use
to be declared statically i.e. Like assembler, or cobol.

Do you see the heap counter method you mentioned to be viable? You can track
allocation that way but deallocation is more problematic because when
finallizers run is non deterministic. This is my understanding. To be clear
the problem here would be a users script may have run out of memory, but the
finallizers just haven't run yet, so the heap counter is not accurate.

I personally think there's a need here that not being met by any current
language or runtime.

~~~
sandius
I think it would be a good start.

Notice that the value of the heap counter would at all times by the upper
bound on the memory usage. In other words, we may always assume that there’s
unreclaimed garbage memory that we’re still counting in. But that also means
that if throw an error if and only if we detect that the limit has been
exceeded, the only error we can be in is a false positive. For sandboxing
scenarios, that’s at least some good news after all: what we definitely don’t
want are false negatives and we won’t get these.

Now, when we've detected that the limit has been exceeded, we don't have to
signal an error immediately. The execution can be suspended (as with CPU
accounting). We can resume the execution any time later, or terminate it with
an out-of-memory error. What we actually do depends on the application: we
could simply decide that getting false positives is a risk worth living with
and terminate the program; we could try calling System.gc() and check the
limit afterwards; we could increase the limit temporarily and check in X ticks
if this was just a spike in allocations etc etc.

I see the chances of getting to 0% false positive rate almost impossible (at
least not on a generic JVM), but I tend to think that such a technique could
go a long way.

Regarding other solutions to this problem out there today: I agree! Using a
programming language that evades these problems entirely would be a solution,
but I'd be skeptical about the chances of persuading users that it's for their
benefit.

~~~
marktangotango
Great! I don't feel like many people are really thinking about this problem.
Just a note, I believe system.gc() is a no op, has been for a while?

~~~
sandius
I haven't seen that much written about this either. My only explanation is
that it isn't generally seen as a problem worth solving, for two definition of
"worth": either it's too difficult or even impossible; a solution to it isn't
needed. I tend to disagree with both.

Btw, thanks for the discussion!

About System.gc(): at least in the Oracle JVM, I think so. But then look at
what it says in the JavaDoc: "When control returns from the method call, the
Java Virtual Machine has made a best effort to reclaim space from all
discarded objects." A no-op interpretation can indeed be a best effort (as in,
"there's nothing to be done!") :) But since it's the only place in the JDK
that gives at least some kind of a GC-related guarantee, it's the best there
is.

These pastures may be greener in other JVMs, though!

