Hacker News new | past | comments | ask | show | jobs | submit login
Emulating an emulator inside itself. Meet Blink (hiro.codes)
196 points by 0xhiro on Jan 4, 2023 | hide | past | favorite | 107 comments



Blink sounds cool, but this blog post is pretty thin. It's just restating a handful of tweets about Blink by its author.


Author of Blink here. Ask me anything :-)


What do you account for the perf win over Qemu? A bunch of micro optimizations, less abstraction layers, or something more systemic?


Blink is like a Tesla sports car whereas Qemu is like a locomotive. I think what may be happening, is Qemu has a lot of heavy hitting optimizations that benefit long-running compute intensive programs. But if you just want to run an big program like GCC ephemerally as part of your build system, the cost of the the locomotive gaining speed doesn't pay off, since there's nothing to amortize over. Blink's JIT also accelerates quickly because it uses a printf-style DSL and it doesn't relocate. The tradeoff is that JIT path construction sometimes fails and needs to be retried.

Another great example of this tinier is better phenomenom, would be v8 vs. quickjs. Fabrice Bellard singlehandedly wrote a JavaScript interpreter that runs the Test262 suite something like 20x faster Google's flagship V8 software, because once again, tests are ephemeral. It's amazing how much quicker QuickJS is. But if you wanted to do something like write a JS MPEG decoder to show television advertisements without a <video> tag then v8 is going to be faster, since it's a locomotive.

Fabrice Bellard wrote Qemu too. But I suspect his Tiny Code Generator has gotten a lot heftier over the years as so many people everywhere contributed to it. I really want to examine his original source code, since I'd imagine what he originally did probably looked a lot more like Blink than it looks like modern Qemu.


Hi Justine, QEMU developer here. Great job on Blink! You have done a lot of cool work and it's been fun to follow. I enjoyed looking at different choices you made in the frontend, for example flags handling is very different from QEMU.

QEMU's code generator is actually pretty fast and shouldn't really be expensive. It's a handful of passes that are run on individual basic blocks, certainly not optimal when a lot of code runs once as is the case for a very short compile but it's nothing like v8.

I suspect an even more silly reason—startup time might even be the biggest factor, because I think qemu-user's startup has never been optimized. I assume both QEMU and blink binaries are statically linked (or both dynamically linked, alternatively)?

Anyhow these theories should be pretty easy to disprove just by compiling something larger than hello world, so I will do it in case there's some low-hanging fruit left.


Is there a way to run linux binaries on macOS and Windows with qemu-user? As far as I can tell it only runs on Linux (and I guess BSD). Performance considerations are nice, but I think the most interesting thing about Blink is the promise of running small linux binaries on these other platforms.


No, it would be easy to do on Darwin but only if you were willing to trade off emulation accuracy and QEMU wants to be able to run any Linux program.

Even if you remain in the Unix world (so macOS but not Windows), system-call–level translation will let some of the differences between operating systems transpire to the program; and it will cause failures if a program has #ifdefs to distinguish between Linux and Darwin but then sees Darwin behavior under Linux.

On Windows, you would basically be reimplementing cygwin—imagine what a mess it would be for a Linux program to see Windows paths with drives and backslashes.

Even though blink can run any Linux program in principle, the main idea is to use it for the author's "cosmopolitan libc", to run programs that are essentially scripts compiled to machine code. So you'd write these programs portably but still be able to ship a single binary.


Fantastic, hope you find some things this stuff is great to do and almost as much fun to read about.


> Fabrice Bellard singlehandedly wrote a JavaScript interpreter that runs the Test262 suite something like 20x faster Google's flagship V8 software

Something is wrong here. How did you test this? QuickJS might start up faster on very small testcases but V8 is not that slow; it needs to have very low latency on a webpage too. Did you run a debug build or something?


I have no knowledge of what allows QuickJS to run the tests faster, or if it even does run the tests faster, but QuickJS does have one big speed advantage over V8 in some circumstances: QuickJS allows ahead-of-time compilation of JS to byte code. This removes the need to parse the JS at execution time. It’s a pretty nifty feature.


FWIW v8 does too through its "snapshot" feature. It's used by Chrome and Node.js to speed up the initialization of the standard library. It might even be faster because it includes both the bytecode and the heap.


Once upon a time, Firefox used to do that for extensions.


That's a tradeoff that might make sense for optimising single-shot execution environments - I'm thinking Serverless/Lambda.


Fascinating. Most JS code is ephemeral, i.e. rarely is something as intensive as video encoding done in the browser (and even then WebAssembly would usually be preferred).

It seems to me like browsers would benefit from running most code in QuickJS, and then spinning up V8 only for those rare cases of long-running JS?


"Ephemeral" is relative. Most JS code in the browser runs for at least 30 seconds, if not longer, as the user interacts with the page. That's plenty of time to spend spare cycles on JITing in the background to make responsiveness better without worrying about 100s of milliseconds of startup / shutdown latency.


V8 is optimized for real-world use cases, not benchmarks. Any modern browser will blow QuickJS out of the water for anything that's non-trivial.


Would it be fair to describe Blink's JIT as more of a "baseline JIT" to QEMU's "optimized JIT", or does that analogy not accurately capture what you mean in the first paragraph?


Both of them are pretty basic but Blink is closer to subroutine-threaded code than to a full JIT. IIRC it has no intermediate representation, liveness analysis or register allocation.

QEMU has all of those but only within a basic block, so it does not do any complicated data flow analysis. It's what would be considered a baseline JIT in the JavaScript world. Rosetta is similar too as far as I understand.


> Fabrice Bellard singlehandedly wrote a JavaScript interpreter that

No he didn't.


Charlie Gordon was involved too, I suppose would be a more constructive comment.


So we agree that he didn't write it singlehandedly, then.

jonah_hill_fuck_me_right?.jpg

(We don't agree on the appropriateness of the description that he "was involved", however. That sounds like an attempt to understate/diminish.)


QEMU TCG is not particularly optimized for performance. It's not all that hard to do better than it, especially if you target only one architecture.


It's also not that easy though, and blink's code generation is comparable to QEMU in 2007 or so. Even if TCG remains relatively basic, 2x better code generation then QEMU is still a strong claim.

I suspect that the reason why blink is faster in this experiment is not related to code generation, as I mentioned in another comment. Looking again at the screenshot, the 28 context switches (vs 0 for blink) might be a clue as well.


Qemu generates pretty good code. I've skimmed through your codebase. Qemu knows all the ISA-specific operation constraints, that lets it work very similar to a compiler. All I'm saying is that Qemu generates that quality code slowly. That's what I mean by a locomotive. A train can go hundreds of miles per hour faster than a Tesla sports car. But which is quicker off the mark? Blink basically just glues together functions that were created by the compiler. So when it runs a program like GCC, which has an enormously long initialization ritual, the JIT path generator is able to plow through it at a speed comparable to memcpy.

While I have your attention, may I ask if you've considered saving your JIT's output to a file, that can be rapidly loaded if GCC is launched a second time? Ephemeral commands tend to be executed independently many times, which would seem to favor an AOT approach. But I don't see any reason why a JIT can't persist its output to gain the advantages of AOT. I'd call it EOT.


> may I ask if you've considered saving your JIT's output to a file, that can be rapidly loaded if GCC is launched a second time?

I don't work that much on QEMU TCG actually, but it would be possible to do so. In fact recent (unrelated!) changes to QEMU might even make it possible to preserve ASLR with such a first-execution JIT compilation.

That said I have rarely seen code generation in QEMU's profiles. In the end what matters is real world performance and I need to redo the test myself to understand what's going on and whether your benchmark is representative of e.g. building a small but nontrivial program (let's say blink itself) with both QEMU and blink. In that case there would be repeated cold-start recompilation, but also the compiled code would run at least once per function so QEMU would have an edge.

To be honest blink is probably more like a bicycle than a sports car. It will start faster than a locomotive, but the cruise speed is definitely lower.


> A train can go hundreds of miles per hour faster than a Tesla sports car. But which is quicker off the mark?

Top speed of most trains is below that of a Tesla let alone super cars. The trains that can go significantly faster (but not really hundreds of miles per hour faster - Maglevs - are capable of similar acceleration). This analogy is not the greatest for the point being conveyed.


I think the point here is that trains go for hundreds of miles faster than a Tesla. If you're going 800 miles, a train might well get there first, though a Tesla's top speed is 4x the trains.


OP: “A train can go hundreds of miles per hour faster” Anyway I think this proves the limitations of this analogy.


I think this comes down to a difference is perspective of trains. Coming from a European perspective when I first read the analogy I was imagining the incredible fast passenger trains over here (~320km/h).

I suspect if you are coming from the perspective of a US rail user then yes, they aren't exactly known for high speed train travel.


320 km/h is not hundreds of miles per hour faster than the top speed of a Tesla, especially one like the Plaid marketed for speed. I don’t give a fuck if you’re from Europe or the US, high speed trains aren’t faster than high speed sports cars. This is from the perspective of someone who’s not an idiot.


Last time I checked on blink, it couldn't run dynamic executables (or ELFs that run with the dynamic interpreter), and would result in a segfault. How has it improved since then?


It has improved! What happens now is it prints an error:

    $ o//blink/blink /bin/ls
    error: unsupported executable; we need:
    - flat executables (.bin files)
    - actually portable executables (MZqFpD/jartsr)
    - statically-linked x86_64-linux elf executables
Blink really isn't intended to run the binaries that come with your distro, because you're already able to run them. Blink is intended to let you transplant x86-Linux executables onto non x86-Linux systems.

For example, I like to build programs on my x86 Alpine Linux machine and then scp them onto my M1 MacBook, Raspberry Pi, FreeBSD, and Windows machines. Blink lets me run them once I do that. Copying program files only makes sense if the program files are static. Linux distro executables can't be distributed to somewhere other than the distros that created them, unless a tool like Docker is used which recreates the whole distro.

Blink is for people who want to be able to distribute Linux software without having to be in the Linux Distributor business too.


If the binary is linked against a old enough version of glibc, it should run on most non exotic Linux distributions, shouldn't it?

At least I feel I'm regularly installing things from people that are not in the Linux Distributor business and are not huge go binaries either (though admittedly, those seems to be trendy as well :) )


Yes... You can technically spin up a RHEL5 VM and compile your code using GCC 4.2.1. That's what I think the Python community still does for manylinux1. The problem with that is GCC got so much better in the last fifteen years, that you might as well be programming with one hand. We shouldn't need to do that. There's nothing wrong with modern GCC 9+ which can produce backwards compatible binaries just fine. The problem is the policy decisions made by the Glibc developers. So if you can avoid Glibc and dynamic linking, then distributing to many distros and operating systems while using modern tools becomes easy.


Those binaries might also just be statically linked with MUSL (at least that's how I'm building my distro-agnostic Linux command line tools). Same idea as Go binaries though (except that I haven't noticed that such binaries are particularly big versus binaries that link against glibc - of course you need some 'meat' in the tool, not just a plain 'hello world', but the difference is measured in kilobytes, not megabytes).


Hi jart! You're an inspiration! That's all :)


The comparison with QEMU is with KVM disabled, right? Assuming this is true, how does it compare with KVM enabled?


I think this is a user mode emulator, so qemu with kvm isn't a great comparison.


Blink is primarily a user mode emulator, but it does support real mode BIOS programs. It can even bootstrap Cosmopolitan Libc bare metal programs into long mode. Here's a video of Blink doing just that. https://storage.googleapis.com/justine/sectorlisp2/sectorlis...


Is this true? Why can’t qemu use kvm for user mode emulation?


KVM requires additional privileges. A Linux container would need privileged rights and access to /dev/kvm to run QEMU with KVM for example, whereas any container should be able to run it in user-mode.


That's not really an issue, as there's a lot of infrastructure around optionally giving device file access to containers. That's why SECCOMP_IOCTL_NOTIF_ADDFD exists.


Nobody's really set it up to do that as it's easier to use Linux's sandboxing features if you're looking to run user code of the same cpu ISA. GVisor has an (experimental last time I checked) backend that uses KVM to run user mode code, but there you have the win of the sandboxing code being written in a memory safe language and giving you a real privilege boundary as opposed to the sieve that qemu-user is. In just about every other instance just running code natively in regular user space (even if sandboxed with seccomp or a ptrace jail) achieves the underlying goals better.


It depends on whether you're more afraid of language bugs or hardware bugs. One potentially nice thing about having a tool like Blink that can fully virtualize the memory of existing programs, is it's sort of like an extreme version of ASLR. In order to virtualize a fixed address space, you have to break apart memory into pieces and shuffle them around into things like radix tries, and that might provide enough obfuscation of the actual memory to protect you from someone rowhammering your system. I don't know if it's true but it'd be fun to test.


KVM allows you to run guests directly on the CPU and has native performance


Well, not quite 'native'. TLB refills are 4x to 5x as expensive, and anything that needs a context switch tends to be at a minimum twice as expensive, and it's common to balloon even farther from there.


I guess that's mostly if you are running a full operating system inside it, generally in Qemu. It doesn't have to be - could just be a program. Tiny programs running in KVM can use big pages and never cause or require any pagetable changes.

For simple workloads it can even be faster than native unless you dynamically load something that uses bigger pages for your native program, eg. https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...


It's harder to force huge pages on a guest than it is to just use them in regular user space where you can simply mmap them in.

And none of that accounts for the increased context switch time.


The guest is not in control - sure theres a few pages at the beginning of each section that has to be 4k until you reach the first 2MB-multiple.

What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

The point is: KVM is native speed if you never have to leave. I don't need to prove this for anyone to understand it has to be true.


> The guest is not in control

The guest has it's own page tables above the nested guest phys->host phys tables.

> What context switch time? It takes 5 micros to enter and leave the guest. The rest is just "workload".

And then the kernel doesn't know what to do with nearly every guest exit on KVM, so then you trap out to host user space, which then probably can't do much without the host kernel so you transition back to kernel space to actually perform whatever IO is needed, then back to host user, then back to host kernel to restart the guest, then back from host kernel to guest. So six total context swaps on a good day guest->host_kern->host_user->host_kern->host_user->host_kern->guest.


Right, that's very true! It's clear that you know what you're talking about when it comes to KVM and maybe even the internal structure in Linux. However, I/O can be avoided. Imagine a guest that needs no I/O, doesn't have any interrupts enabled, and simply runs a workload straight on the CPU (given that it has all the bits it needs). That is what I have made for $COMPANY, which is in production, and serves a ... purpose. I can't really elaborate more than I already have. But you get the gist of it. It works great. It does the job, and it sandboxes a piece of code at native speed. Lots of ifs and buts and memory sharing and tricks to get it to be fast and low latency. No need for JIT, which is a security and complexity nightmare.

The topic of this thread is about Blink, which happens to be a userspace emulator. Hence my comment.


I usually measure the functions I write in picoseconds per byte, so 5 microseconds is an eternity.


10 ps/byte is equivalent to 100 GB/sec; unless you routinely write functions that are in the tens of GB/sec range, you probably mean nanoseconds?


I work on a C library. Some of the functions I've written, like memmove(), take about 7 picoseconds per byte for sizes that are within the L1 cache, thanks to enhanced rep movsb.


That's a very special case though since it's hardware optimized to work up to a cache line at a time, and not at all related to the syscall cost that was mentioned in the parent comment.


The 5us was the setup time in order to be able to enter the sandbox. A system call is around 1us, but rarely used. So, in general the overhead of using the sandbox is around 5us, as everything else is pure workload.


> Blink is at least 2 times faster than QEMU

> Blink is now outperforming Qemu by 13% when emulating GCC.

Nice work. But isn't QEMU notoriously slow?


It's faster than you think. :)


Where is the getting started guide for this?



Incredible work, jart! Have you tested if it can run on Asahi Linux (Linux for ARM Macs)?


That reverse debugging feature looks cool. At a high level, how does it work?


Very simply. Took three hours to code. Blink just takes a "screenshot" of the ANSI terminal display, gzips it, and appends it to a ring buffer with 64,000 entries. Each screenshot is a full frame. There's no delta encoding or anything smart like that. 64k gzipped frames takes a few hundred megs of memory once it fills up, depending on how big your terminal display is. My favorite part is that, when you press the 'c' button to continue execution, Blink will only transmit 60 FPS to your terminal. In-between frames, it's still capturing screenshots in the background, which get added to the ring buffer. It'll save thousands of frames per second, so you can skim through execution pretty quickly, like a tv show, and then maybe you'll see something happen, smash ctrl-c, and scroll backwards slowly with fine-grained per-instruction granularity, thus illuminating the event you saw fly by earlier. It's so nice. Especially if you're used to tools like GDB TUI, which permits scroll-wheeling down the assembly display, but it won't let you scroll-wheel back up to the thing you were just looking at!


Oops, maybe not that high-level. I meant, underneath the UI layer, how is the state/execution path of the emulated CPU efficiently recorded during execution and later rewound? Any approaches I can think of would destroy the performance. I have a toy emulator project that dumps the entirety of RAM every frame and I'm trying to come up with ideas to improve its efficiency.


Blink's reverse debugging is implemented purely at the user-interface level. It doesn't have anything to do with the execution engine. While Blink does have near-perfect knowledge of memory, I considered doing the thing you proposed, where deltas are stored and reconstructed. I decided that it wouldn't offer compelling benefits beyond just capturing and rewinding screenshots. For Blink's use case, screenshots are arguably superior, since it preserves perfectly the UI states previously observed by the user. I understand this might not be an option for your project, since you might not have a display that's visualizing program memory and execution. In that case, I'd suggest why not build that first?


That's a really slick approach. I reread your first post again and you explained it perfectly, I just didn't grok it the first time around for some reason. Maybe I thought it couldn't be that simple... but it is! Thanks for patiently explaining twice.


Why bother making this? (even if it is really cool)



We do what we must because we can.


Where is it? I do not have a twitter account.


Self-hosting is certainly a thing if the emulator can run on the emulated platform. Same for VMMs that can run on a virtualized platform.

It's a nice test also and can be useful for debugging.


This guy got Doom running inside of Doom using a code execution exploit

https://www.youtube.com/watch?v=c6hnQ1RKhbo


Is it actually 2x faster or 2x faster at starting up? QEMU does so much stuff, running cc1 on hello world isn't really a stress of the interpreter IMO as much as all the crap that goes around it.


Blink actually does run the GCC9 CC1 command from start to finish twice as fast. Qemu takes 600ms to run it and Blink takes 300ms. Both Qemu and Blink use a JIT approach. Since GCC CC1 is a 33mb binary, a lot of the time it takes to run it, it stresses the JIT pretty hard. https://twitter.com/JustineTunney/status/1610276286269722629


That's partly what I meant though, how fast is it at a longer running process? C doesn't require all that much semantic analysis so there usually isn't all that much hot code in the compiler, so it would suit a simple-fast JIT whereas QEMU does do some basic optimizations.

I've only ever really skimmed the TCG source code but it wouldn't surprise me if a new-er JIT could smack it's arse given that with these old C codebases (it's probably one of Bellard's few flaws) it's pretty hard to actually make true architectural changes.

The Java/script (I think more Javascript but I'm hedging my bets by including jvms too) JITs are probably the cutting edge but I'd imagine still quite beatable for a few cases.


Rosetta is closer to QEMU than to v8 or Hotspot. Granted, it benefits from an insanely large out of order execution engine, but it shows that there's only so much you need to optimize when translating assembly code.


"blinkenlights" put a smile on my face.

Looks like it itself is not yet able to be compiled with Cosmopolitan Libc (though it emulates programs compiled with it) but it's planned - very cool!


Author here. I'm planning to get Blink to compile with Cosmopolitan Libc as soon as possible. There's just a few switch statements that need to be refactored. There's a really nice `cosmocc` toolchain that makes building POSIX software with Cosmo easier than ever. See https://github.com/jart/cosmopolitan/blob/master/tool/script... and https://github.com/jart/cosmopolitan/blob/master/tool/script...


Will compiling with Cosmopolitan enable it to run on Windows?


Absolutely. If you download last year's release of Blinkenlights, you can actually use this software on Windows today. It works great in the Windows 10 command prompt or powershell. https://justine.lol/blinkenlights/download.html


Awesome. I actually started trying to get it to build against mingw-w64 earlier today, but I guess I'll just wait for you. :)

I'm not a windows user, but super interested in using Blink to ship pre-compiled binaries as part of various Bazel rule sets.


I've wanted that for the longest time. You know the feeling. You mostly write something like Java and Bazel works fantastically at that. But then you find out you need about 100 lines of C code in your build to do some specific thing Java can't do, and suddenly you've opened pandora's box because compiling native code across platforms is a nightmare. Life would be so much easier if we could just vendor a 40kb prebuilt Linux binary that could run on all the other platforms too. All you need now to do that, is a few additional ~200kb blink binaries.


> Me: How small can an emulator be? Blink: Yes.

My favorite FAQ


Does it require cosmopolitan libc? I compiled a hello world example in C with `zig cc --target=x86_64-linux-musl` but when running it with Blink, nothing happened.

With memory safety turned on, could Blink be used as an alternative to WebAssembly?


At a glance, the debugger user interface looks much nicer than gdb's terminal ui. How tightly coupled is the debugger interface to the emulator/debugger engine? How much work would it be to plug in a different debugger, say lldb or gdb, into the ui instead of blink?

I think the user experience of cli debuggers is generally somewhat dreadful when compared to their gui cousins -- they seem to display a much narrower view of what's going on. Could the big blinkenlights debugger view be useful outside of blink itself?


You can go other way around and use other TUIs for GDB:

* https://github.com/pwndbg/pwndbg * https://github.com/longld/peda * https://github.com/hugsy/gef


Ideally Blink would just support the GDB RSP, so you could directly use GDB or LLDB on the emulator itself.


Why would you open a blog post about X not with a link to it, but a link to the author's twitter page? That has no use to many of us, who can't even scroll down through the tweets looking for a link should we feel... spelunky....


So this is 600 versus 300 microseconds, and the difference could be in the startup time. Wouldn't it make much more sense to test a longer running program?


You must be from a country like Germany my friend, because the comma is part of the number. It actually reads ~330000 microseconds. I really should switch to using an apostrophe instead.


Blink is a new CPU emulator written in C, made by Justine Tunney. Besides having a really cool name, blink has a lot of impressive features and some of them will blow your mind!


Is it not too late to RiiR?


Video conferencing is mentioned.

Anyone know of any other use cases they have in mind?

I always hear tech spec rumors but never about anything I would want to do with this type of thing… outside say gaming?


I think you're looking for the Apple headset thread: https://news.ycombinator.com/item?id=34250929


TY


Congrats for the author. Though I wonder: what's the catch when it comes to perf? I can't imagine QEMU devs leaving much performance on the table either.


Blink is currently weak in a few areas, like SSE floating point instructions, which aren't JIT'd very well. Blink is also weak at x87 floating point emulation, which gets truncated to double precision. I anticipate issues like these will improve quite rapidly. It was only two months ago, before I started optimizing this project, that Blink went the same speed as Bochs (i.e. 100x slower). Today, Blink is running slow start ephemeral programs such as GCC 2x faster than Qemu, and Blink now has comparable speed at some mathematical benchmarks, such as elliptic curve crypto.


Can it emulate CPython? So, can you do reverse debugging of Python? I think this would be interesting.


Blink can absolutely emulate CPython. https://twitter.com/JustineTunney/status/1611264272637775873

I recommend obtaining Python from the Comsopolitan mono repo:

    git clone https://github.com/jart/cosmopolitan
    make -j8 o//third_party/python/python.com
That'll easily give you a static Python 3.6 binary that can be run under Blink.


This is really neat, but not to be confused with Blink, the name of the browser engine underlying Google Chrome, Chromium, and derivative browsers.


The problem is compounded further considering the most popular terminal emulator on iOS is also called Blink.



Pretty sure there are several SSH clients that are several times more popular than Blink :)


There is also an iOS/iPadOS SSH Client called Blink [1], short for Blink Shell, which I use almost daily.

[1]: https://blink.sh/


I was thinking they compiled Blink to Javascript and are rendering web pages with it.


We just managed to compile Blink to a 300kb javascript file today. Follow https://github.com/jart/blink/issues/8 for updates on our progress.


But I was thinking of HTML renderer...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: