

Show HN: Open Source emulator in Javascript that runs Linux - reginaldo
http://www.ubercomp.com/jslm32/src/

======
reginaldo
Since there seems to be some interest on the subject, and I don't want to
kidnap the great thread about the worst Linux PC ever [1], I decided to submit
on a separate thread.

Some things do keep in mind:

1\. Very early stage. Runs better on Chrome.

2\. If you think this sucks, you're probably comparing my code to Fabrice
Bellard's Jslinux[2], and as Bellard is a genius his code (which is
unfortunately still not available in readable form) is probably much better
than mine. But progress is being made.

3\. Still, the potential is there, soon it will be possible to have a somewhat
standard programming stack and write GUI applications that run off a canvas,
as I said in [3].

4\. For Chrome there is a nice terminal emulator (hterm) I lifted from the
Chromium source tree. The rest use _textarea_.

5\. The code, of course: <https://github.com/ubercomp/jslm32/>

If anyone has questions, I'll be happy to answer them.

[1] <http://news.ycombinator.com/item?id=3767410> [2]
<http://bellard.org/jslinux> [3] <http://news.ycombinator.com/item?id=3769486>

EDIT: Added itens 4 and 5.

------
jbert
Very nice. Is there any chance of a follow up on your comment in the previous
thread:

"I'll write a post describing what I did to take boot time from 2.5 minutes to
10 seconds."

My interests:

\- how did you find what to optimise? \- what approach did you use?

~~~
reginaldo
> \- how did you find what to optimise? - what approach did you use?

I used profilers: Chrome developer tools and Firebug. The Chrome profiler is
very nice and easy to use, but it looks like it is a statistical profiler.
When the usual invocation takes less than 1ms, it might get the times wrong
(there might be a bottleneck that Chrome profiler is not showing).

The Firebug profiler is more precise, I believe it traces every executing
instruction, which also makes it much slower to run you program in, but does
sometime gives you interesting insights.

Also, there's intuition. I wanted to keep the code readable. The interpreter
loop is basically something to the effect of

    
    
      while(true) {
        op = decode(mmu.read_32(cs.pc));
        cs.pc += 4;
        (optable[opcode])(op);
      }
    

which I believe is very readable, but it counters everything one expects about
writing fast javascript code that runs in a tight loop. Bellard told me he
thinks my "dynamic interpreted" version might be slower than a throughly
optimized interpreter version, and he might very well be true. But I found the
readability of the interpreter to be of the upmost importance, at least until
I had a working "dynamic recompilation" version.

To optimize the interpreter version, the profiler showed me that basically
three things take most (about 65, 70% of the time: * Instruction decoding *
Memory access

There's not much I found to make instruction decoding faster , but the memory
access time could be improved. In the beginning, I used the form
mmu.read_32(addr) to read from any region of the memory, including RAM. Most
of the accesses are of course from RAM, so it made sense to optimize for this
fact (despite it being bad for readability).

So nowadays I do: op = (v8[rpc] << 24) | (v8[rpc + 1] << 16) | (v8[rpc + 2] <<
8) | (v8[rpc + 3]);

For memory accesses.

Some other things I found randomly. The processor in the board I'm emulating
has a 75Mhz clock. With that emulated frequency, Linux was spending a lot of
time on the "idle task". Whan I told the emulator that the frequency was about
8Mhz, the problem went away.

Then, I thought to myself: it would be neat if I could implement parts of the
libc in Javascript. That's how
[https://github.com/ubercomp/jslm32/blob/master/src/lm32_libc...](https://github.com/ubercomp/jslm32/blob/master/src/lm32_libc_dev.js)
came to be. So I implemented some string.h functions in javascript (memcpy,
memmove, and memset), and voilá, the boot time went down about 3 times.

But in the end of the day, lm32_libc_dev is a hack with very negative
consequences. I would have to modify every program I wanted to run on the
emulated board to use my libc functions instead. I did this with the linux
kernel and it was a bad experience. So I focused my energy on the dynamic code
generation side of things.

I can tell you I have succeeded as the jslibc thing is now deleted on my local
source three and I didn't experience any slowdowns whatsoever.

~~~
jbert
Thanks for this, it's a great piece of work and very easy to read. Nice code.

> "it looks like it is a statistical profiler. When the usual invocation takes
> less than 1ms, it might get the times wrong"

I don't understand this. As long as the statistical profiler can interrupt at
any point, surely a routine running for 10k times for 0.1ms will still be
interrupted a similar number of times as one routine running for 1s? Or are
you thinking that there may be a bias to the sampling? (e.g. method of
interrupting means we're not likely to see the first N ops after a method call
or something)?

Did you see a significant difference in the profile between the two profiler
approaches (firebug/chrome)?

> For memory accesses.

I wonder if your RAM accesses are mostly 32bit aligned and made in 4bytes at a
time? (I guess you could instrument to find out)? If so, you might get a
further speedup by having the RAM array stored as 32bit quantities instead of
bytes, doing:

op = v832bit[rpc]; // only works for aligned code

and then having the RAM read/write 8/16 be special cases on top of that? That
could also give a speedup on memmove etc (4x speedup?), making the libc hack
less necessary?

~~~
reginaldo
> I don't understand this..

Well, I won't speculate on the workings of the Chrome profiler, but what I can
say is that profiling in Chrome and Firefox gives different results (which may
of course be caused by differences in the Javascript engines).

> I wonder if your RAM accesses are mostly 32bit

All accesses on the LM32 processor are aligned. Half word accesses have 2 byte
alignment, and word accesses have 4 byte alignment. So in theory it would be
easy.

I use ArrayBuffers for the RAM array (where supported). You can have different
views for the same array buffer, so in theory I could have the ram ArrayBuffer
with an 8bit view, a 16bit view, and a 32bit view, where each would be used
for accesses of their size. In that sense, you're absolutely right and I think
a good speedup would be achieved by that change (something like 2x speedup as
most accesses are in fact 32 bit).

I don't do that for endianness reasons. The LM32 is big endian, and most
devices nowadays are little endian. Unfortunately, the state of ArrayBuffer is
still kind of chaotic, in that the endianness of the ArrayBuffer is the
endianness of the host (they do this for performance reasons). So I can't do
the v32[rpc] thing. Theoretically, I could if the programs have Read-Write
consistency, i.e., if they only used the same size for reading and writing
some data, which they don't. A program that tests the endianness of the
processor, for instance, does not have read write consistency.

I could use ArrayBufferViews to access memory in an endianness-independent
way, but I tried it and it's actually much slower.

To make memory faster, two things can be done either: 1) emulate a little
endian processor instead. 2) have a fast byteswap operator in javascript,
without a function call (never gonna happen).

So since 2 is never going to happen, the solution is to tackle 1.

~~~
jbert
Ugly thought - if "almost all" accesses are Read-Write consistent, maybe an
exception list of addresses could be used?

Basically do the 4byte cell approach, and make most of your RAM the "fast"
endianness, with some 4byte cells "correct" endianness, because you know they
are being accessed in a way that cares. (Since you're interpreting, I'm
guessing this is reasonably-easily detectable?)

It does put a "if-addr-on-exception-list" on your mem access path, but if
that's a hash lookup it could be OK. If you can further guarantee that your
text segments are always fast-endian, then your opcode dispatch loop can still
go direct and miss out that test.

~~~
reginaldo
That would actually work, I guess :). I'm pretty sure the memcpy, memmove,
etc. implementations I've seen for this architecture all work by copying one
byte at a time (argh!) though, but that problem can be solved with some smart-
if-ugly hackery.

But, as you said yourself, it would be kind of messy, in that it would be hard
to be certain if it works in 100% of the cases.

I think it would be easier to just change the toolchain code and make it think
the processor is little endian.

------
rdg
Hey reginaldo, for some reason I cannot enter any text on the terminal, hence,
I cannot login. I'm using Chrome 18 on OS X Lion. Just in case, I tried
downloading the code and running it locally: same result. Any ideas?

~~~
reginaldo
Yeah, it is a known bug. The hterm terminal emulator runs on its own iframe,
so you'll have to click anywhere on the screen (outside the terminal). If that
doesn't work, then it's a new bug...

------
int3
You mentioned in the 'Worst Linux PC' thread that your implementation
generates JS code on the fly and 'captures up to three backwards jumps'. Do
you mind explaining in more detail how this works?

~~~
reginaldo
Ok, one way to do interpretation is to do a loop like

    
    
      while True:
        instr = decode_instruction(pc)
        execute_instruction(instr)
    
    

Where execute instruction is usually a big switch that has every opcode on it,
or, in my case, is a call to a function in a table whose indices are opcodes.

The problem is you decode the same instructions many times (in a loop, for
instance). A slight more elaborated version is like this:

    
    
      while True:
        if not pc in cached_code:
          block = generate_block(pc)
          cached_code[pc] = block
        run_from_cache(pc)
    
    

The generate_block function decodes and generates code for every instruction
until it finds a jump. When it finds it, it finishes up that block, which is
then added to the cache.

Now, whenever there is a jump to an address that is in the cache, the cached
code can be executed directly.

Finally, the captures up to three backwards jumps thing:

My first try was to have the cached code just be a regular function call
without loops. So a block of code is executed and then control is given back
to the main loop.

But a significant number of jumps have targets that are inside the same block
of code (anything you run on a loop, for instance). So, to get better
performance, I generate, for my block of code, a function that has a loop in
it.

It has a rather characteristic look:

    
    
      function code_block(cs) { // cs is the "cpu state"
          var count = 0;
          cs.next_pc = BLOCK_START;
          block_loop: while(++count < 3) {
              count++;
              block_choice: switch(cs.next_pc) {
                  case BLOCK_START: cs.r0 = cs.r0 ^cs.r0;
                  case BLOCK_START+4: // regular instructions fall through
                  case CONDITIONAL_BRANCH_ADDR:
                      if(branch_condition) {
                          cs.next_pc = BRANCH_TARGET;
                          break block_choice; // this captures backwards jumps targeting the block
                      }
                  case  UNCONDITIONAL_BRANCH_ADDR:
                          cs.next_pc = BRANCH_TARGET;
                  default: break block_loop;
    
              }
          }
      }
    

The "captures up to three backwards jumps things" is controlled by the ++count
< 3 loop condition. 3 was obtained by trial and error and looks like a good
number for my machine (I could, for instance, put 100 in there, but then my
javascript functions would take too much time to return and the javascript
engines don't like this).

~~~
int3
Thanks for the explanation! It makes sense. I was just wondering though -- why
is it necessary for generate_block / decode_instruction to be called on the
fly, instead of as a pre-compilation step? I guess you need some sort of
runtime information for the compilation?

~~~
reginaldo
What you're looking for is called static recompilation. It's what Emscripten
[1] does, it statically recompiles LLVM bitcode to javascript. Emscripten is a
wonderful project, by the way. After compiling from LLVM to Javascript, it
applies an algorithm (relooper) to the "intermediate" javascript code whose
result is code that resembles the structure of the original code so that, for
instance, if the original code has a loop, the final javascript code will also
have a loop. It is a true gem, and it's part of what makes Emscripten so
awesome and fast.

I believe I could do static recompilation for programs that run on a barebones
environment and do not employ self modifying code. The more general problem of
finding all executable code in a blob of arbitrary binary data (which I would
have to solve if I wanted to run an arbitrary program unmodified, which I do),
is equivalent to the Halting Problem, so that is a no go. This means that, to
have a general solution, a hybrid approach of static and dynamic recompilation
would have to be employed.

For well behaved programs, code generation on the fly does not pose a
significant overhead (that's what the profiler tells, if I recall correctly),
so I'm not sure the hybrid approach is worth its weight in complexity.

Besides, I don't know if I'd be able to run a full OS, though, because those
need interrupts (in my case, currently there's a timer tick and the UART
interrupt when input is entered through the keyboard), and I haven't thought
about handling that.

[1] <https://github.com/kripken/emscripten>

~~~
int3
Yup, I'm aware of Emscripten, which was part of the reason behind my question.
I hadn't considered the issue of lack of code / data separation when dealing
with pure binary code.

I'm actually working on a JVM interpreter in Coffeescript now, so this topic
is particularly interesting to me. Thanks again for taking the time to explain
things.

------
fudged
Is the idea of an "average" customer really this lost on so many tech writers?
I thought my little sister was tech-literate, but she doesn't give a shit
about "Ice Cream Sandwich"... she has customized the hell out of her 2.3
handset and there's nothing that she wants changed. How many "average"
customers are entirely satisfied with the smallest amount of customization the
system allows?

I'm returning my new iPad this week because 1) My Macbook Air is light and
more useful, I have no drawing ability, and I don't read much; the iPad is no
more than a toy and waste of money in my ecosystem, and 2) The app store
ecosystem is dreadfully boring on iOS. On Android I feel like I'm actually
rewarding the good developers by buying their Pro versions of apps and such...
on iOS everyone wants your money, regardless of how shitty the product is. On
Android, I see so many developers experimenting across so many categories just
for the fun of it. 3) High-resolution has hardly benefitted anything that I do
on the device. The majority of content that I see (images and video) are low-
res. Plus, very few apps that I use have been upgraded to 'retina' quality.
ALSO, retina doesn't just mean scaling everything up... why is nobody making
their content more detailed? Orthogonal lines don't render that differently in
higher resolution.

