Hacker News new | past | comments | ask | show | jobs | submit login
Understanding every byte in a WASM module (danielmangum.com)
206 points by hasheddan 9 months ago | hide | past | favorite | 37 comments



If you are interested in this type of approach you may like https://wasmgroundup.com/

> This book takes a hands-on, bottoms-up approach: you'll go from hand crafting bytecodes to writing a real compiler for a simple programming language.

disclaimer: I'm the co-author


I'm impressed with the pricing of $19 early access / $39 final version.

I'm used to high-value books for professionals either being completely free,[1] or starting at $100. [2][3] It seems like a good compromise at that price.

[1] https://craftinginterpreters.com/contents.html

[2] https://www.refactoringui.com/#get-refactoring-ui

[3] https://www.amazon.com/dp/0262048434


This is pretty great! I highly recommend building a parser for Wasm, it shows off the nice design that it can be compiled in a streaming fashion and other nice properties.

I built a small parser for a wasm runtime I was going to hack at, although that lost steam and now I just contribute to wasmtime, but it was a great way to learn. The source if you’re interested in C++20 and coroutines: https://github.com/rockwotj/wasmcc


> the nice design

But integers are read 7 bits at a time. Wouldn't it be much easier to use 8 bits?


If you use all 8 bits of a byte for data, you need a separate way to store the length, or you need to make everything fixed-width which wastes space.

"Varints" or variable-length integers, with 7 bits of data and 1 continuation bit per byte, are used in a fairly wide variety of formats (e.g. Protobuf, SQLite and Lucene). They're compact when most of your integers are small, while still being quite simple and fast to encode/decode.

Faster alternatives tend to be more complicated, and more compact alternatives tend to be slower.

See https://en.wikipedia.org/wiki/Variable-length_quantity


I wonder if something resembling this could have been better than how IPv6 attempts to solve the address space problem.


Variable-length IP addresses were considered at the time: https://datatracker.ietf.org/doc/html/rfc1752#section-10.2


This is basically the design of UTF-8 or varint encoding. The integer's representation is read one byte (8 bits) at a time, which can then be interpreted by looking at the first bit which says whether the integer is "done" or more bytes need to be read.

(If we were using all 8 bits for the representation and an additional leading bit for whether it continues, we'd end up having to read 9 bits or something, which is awkward and not aligned with what the machine actually supports reading.)


UTF-8 is a bit more complex than that.

The first byte in a sequence basically tells you how many bytes are following by looking at the top-most bits.

    0xxxxxxx => no followup bytes, just the 7-bit ASCII code
    110xxxxx => 1 byte follows
    1110xxxx => 2 bytes follow
    11110xxx => 3 bytes follow
The followup bytes then always have the pattern 10xxxxxx.

E.g. the only bytes that have bit 7 cleared are the single-byte 7-bit ASCII character codes.


A UTF-8-like encoding would probably be better than VLQ encoding because the length is known up-front, which allows for the decoder to have fewer branches.

Something like [1] or alternatively [2]. See also [3].

[1] https://en.wikipedia.org/wiki/UTF-8#FSS-UTF [2] https://sqlite.org/src4/doc/trunk/www/varint.wiki [3] https://news.ycombinator.com/item?id=11263378


In this case, a 64-bit little-endian unsigned integer Ought to be enough for anyone. [1]

Using variable-length "micro-compression" schemes like these merges concerns [2], add processing overhead, and lose out on the simplicity and efficiency of working on the data directly in memory, maybe even reusing the same memory/cache line after checking identifiers, etc.

The 7-bit VLQ scheme [3] makes sense for MIDI sequences (1980s), as they were stored on MIDI hardware with limited memory, and had basic processing needs.

[1] https://quoteinvestigator.com/2011/09/08/640k-enough/

[2] https://en.wikipedia.org/wiki/Separation_of_concerns

[3] https://en.wikipedia.org/wiki/Variable-length_quantity


Eh, I don't think the .wasm file is intended to be an in memory format. For an on wire/storage format, variable length integers like protobuf uses is a perfectly valid choice.


It's actually the protobuf design mistake that most adversely affects decoding performance. The original creator who went on to make Cap'N Proto makes the excellent point that it's better to leave the extra bytes and then just slap a compressor that only works on runs of 0 bytes on top. Faster for the same savings. There are also newer schemes from Daniel Lemire and others more amenable to SIMD decoding.


Sure, but that decoding speed penalty doesn't stop protobuf (and similar encodings using leb128 like thrift) from being some of the most used binary encodings on the planet, even in relatively low latency spaces.

Additionally, Cap'N Proto RLE encodes the bytestream, which is pretty similar to the overhead of LEB128 integers.

And on top of all of that, .wasm isn't intended to be an IPC format, so the fact that it's choices are typical of that space (despite not being leading edge) reflect very well for the space it's intended to be in.


It would be. The use of varints in wasm complicates encoding and decoding and introduces ambiguity into the format for cases like v128 constants (which despite technically being 128-bit integers, are not embedded using leb128.) It also makes it harder to write a single-pass compiler.

It also isn't much of a wire size improvement, since the overall format design assumes you will compress it using brotli (or gzip, I guess) - there are lots of opportunities to make the format smaller or arrange it in a way that makes it more compressible, and the decision was made not to do those things since brotli is really good at compressing data.

I think the people who took control of the binary format just really liked varints for some reason, so we ended up stuck with them. It's at least not particularly harmful, it's just a small waste of everyone's time. Thankfully, my understanding is that in practice most modules will only get decoded once or twice before the compiled code is cached by your browser.


It is a balance between simplicity, speed and size. They choose size, honesty leb128 isn’t *too* complex (you can see an implementation in my repo I pointed too). But yes they don’t take all the simplicity tradeoffs.


Protobuf seems to use a very similar encoding.


If you're looking into a .wasm file I wrote an interactive viewer called weave[1].

Here is a demo with some Rust output: https://evmar.github.io/weave/?wasm/rust.wasm be sure to click on 'code' to dive into individual functions.

[1] https://github.com/evmar/weave


There’s also https://wasdk.github.io/wasmcodeexplorer/ which highlights the bytes.


I really appreciate this article. I’ve noticed that modern streaming web players have started using WASM as part of their obfuscation toolbox. Understanding WASM modules is important for reverse engineering.


The browser devtools can give you a disassembled view of the WebAssembly byte code (basically the wasm2wat output) and also allow to set breakpoints and debug-step through the disassembly.

Since WASM is a "structured-programming ISA" with code blocks that can be intended, it's actually quite convenient.


Thanks for the pointer!

The WASM payload itself appears to be obfuscated. The player seems to deobfuscate it at runtime, dynamically load it as a WASM module, execute it and then destroy it again. I'm having a hard time pinpointing the exact moment where the WASM is being loaded (because the JS that performs all those things is horribly obfuscated, too.)

The disassembled view in the dev tools might come in handy to tackle this. My goal is to obtain their secret HMAC key so I can change "Linux" into whatever the server prefers to hear, and re-sign the HMAC for that manipulated HTTP request before the player sends it.

(All that effort so I finally get to watch my Bundesliga streams again, for which I'm paying them 30 EUR a month.)


Ctrl+f for the WASM compile function (or maybe replace it with a wrapper function that has a debugger statement), breakpoint there, step over, then look in the Sources panel. You'll find the WAT for your secret wasm then.

It may even still be in Sources panel post-deletion, in which case no need for any of that.


Sky?


NowTV (aka WowTV in German-speaking countries). Part of the code mentions Sky.


This seems sort of like understanding machine code vs assembly; it's much easier to learn WAT and translate to/from WASM as necessary using the wabt tools [0].

Either way its super cool how simple WebAssembly is, you can really get your hands dirty and understand exactly every detail of how your program runs!

[0] https://github.com/WebAssembly/wabt


A while back I ran through some exercises that had me writing WAT directly. It's a great way to understand what WASM is doing, and is surprisingly easy to pick up (though of course you wouldn't want to write an actual program with it). It's just system calls with s-expressions.


WAT is a fun format! I'm curating some readable samples here -- https://github.com/eliben/wasm-wat-samples/


I already had that bookmarked from using it several months ago to try and figure things out. It's been very helpful. Thank you for creating it!


I have a hand-written BF interpreter in WAT, along with hand-written asm.js for comparison, over here https://github.com/magcius/bfasm


Bookmarking that for later, thanks for sharing!


For some time I've been fascinated by the codebase of a small WAT compiler written in JavaScript.

https://github.com/stagas/wat-compiler/blob/main/story.txt

And I mean "small" as a real complement to how readable and compact the entire compiler is. It's been a great way to appreciate the design of the WASM text format and WASM overall. It's not a Lisp but has a similar feel to it.

I've been meaning to get more fluent at writing WAT directly, not for any practical purpose but just for pleasure of it. I could see myself gradually building up some abstractions to help me deveolp larger programs, perhaps a slightly higher-level language.


Very cool. You might be interested in AssemblyScript, if you haven't heard of them before. They did something very similar: https://www.assemblyscript.org/


Ah yes, I'd love to get more familiar with AssemblyScript too. One thing I like about the afore-mentioned WAT compiler though, is that it's simple to run in the browser, which the AssemblyScript compiler doesn't support, if I understand correctly. That characteristic of the compiler would be useful if I want to build a higher-level language on top of it, for example with a web playground (code editor and runtime).


The AssemblyScript compiler runs in a browser too, that’s how the demo on the homepage works


Thanks for mentioning it, I dug into the website of AssemblyScript and found that the compiler does indeed run in the browser. The demo loads a web-build like:

  <script src="https://cdn.jsdelivr.net/npm/assemblyscript@latest/dist/web.js"></script>
  <script type="module">
  
  import asc from "assemblyscript/asc";
And how they compile AssemblyScript to WAT/WASM/JS:

https://github.com/AssemblyScript/website/blob/main/src/.vue...

Here's how the compiler is built to target the web.

https://github.com/AssemblyScript/assemblyscript/blob/main/s...

How fun! This inspires me to explore further what I could build with it.


I entered into WASM assuming it was simple and minimal but there are a lot of runtime edge cases that are quite surprising. I built an interpreter that passes all the non-SIMD test cases from the official spec. Would be interesting to clean-up the spec to be a real minimal execution only core. There is a real barrier to entry for producing new lightweight implementations.

https://github.com/peterseymour/winter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: