It's useful whenever you don't know the value of an integer but would like to allocate space for it now, and then fill in the value later. Many have mentioned length-prefixed data, which is a good example. Another use case is static linking. I believe LLVM uses this when generating WASM object files.
I think this is probably the real reason such encodings are considered valid. The webassembly spec is explicit about allowing valid over-wide encodings:
Non-canonical encodings are actually quite useful for some applications that need variable length integers. DWARF and WASM both use LEB128.
The problem is linking: a compiler needs to emit code into independent translation units, which contain "missing" references to symbols in other translation units, without yet knowing where all the code will end up in the final executable. Since we don't know where the location of other code is yet, we don't know how big the number representing that location is yet, which means that we don't know how wide the variable length encoding of that number will be. If the width changes after linking, then we have to push around the surrounding code to make space for the wider integer. Unfortunately, this changes the location of all the surrounding code, so we have to recompute all the references!
The solution is to always emit un-linked var ints in the widest possible encoding (5 bytes for LEB128) that way when the references are patched during linking, no code is moved around. All integers can be converted to a non-canonical 5 byte form that is "wasteful" but its a worthwhile tradeoff because it solves this issue. Other integers that don't need to be linked can be packed in a smaller var int form to save space.
Neat! It's a useful technique whenever you don't know or want to defer knowing the size of an integer until a later time, but need to allocate space for it up front.
I'm wary of introducing these forced-canonical encodings by someone hyper focused on "efficiency" and "security" without reconsidering additional use cases.
What good does it really do me if they "stand behind their work"? Does that save me any time drudging through the code? No, it just gives me a script for reprimanding. I don't want to reprimand. I want to review code that was given to me in good faith.
At work once I had to review some code that, in the same file, declared a "FooBar" struct and a "BarFoo" struct, both with identical field names/types, and complete with boilerplate to convert between them. This split served no purpose whatsoever, it was probably just the result of telling an agent to iterate until the code compiled then shipping it off without actually reading what it had done. Yelling at them that they should "stand behind their work" doesn't give me back the time I lost trying to figure out why on earth the code was written this way. It just makes me into an asshole.
It adds accountability, which is unfortunately something that ends up lacking in practice.
If you write bad code that creates a bug, I expect you to own it when possible. If you can't and the root cause is bad code, then we probably need to have a chat about that.
Of course the goal isn't to be a jerk. Lots of normal bugs make it through in reality. But if the root cause is true negligence, then there's a problem there.
If you asked Claude to review the code it would probably have pointed out the duplication pretty quickly. And I think this is the thing - if we are going to manage programmers who are using LLM's to write code, and have to do reviews for their code, reviewers aren't going to be able to do it for much longer without resorting to LLM assistance themselves to get the job done.
It's not going to be enough to say - "I don't use LLM's".
Yelling at incompetent or lazy co-workers isn't your responsibility, it's your manager's. Escalate the issue and let them be the asshole. And if they don't handle it, well it's time to look for a new job.
Nope. I was thinking about implementing call/cc; there is a neat trick involving first doing a CPS transformation to the source, then providing call/cc as a builtin function that more-or-less just grabs the continuation argument and returns it. This would slot pretty easily in between expansion and code gen and the code generator would remain mostly untouched.
reply