Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Prompts as WASM Programs (github.com/microsoft)
209 points by mmoskal on March 11, 2024 | hide | past | favorite | 35 comments
AICI is a proposed common interface between LLM inference engines (llama.cpp, vLLM, HF Transformers, etc.) and "controllers" - programs that can constrain the LLM output according to regexp, grammar, or custom logic, as well as control the generation process (forking, backtracking, etc.).

AICI is based on Wasm, and is designed to be fast (runs on CPU while GPU is busy), secure (can run in multi-tenant cloud deployments), and flexible (allow libraries like Guidance, LMQL, Outlines, etc. to work on top of it).

We (Microsoft Research) have released it recently, and would love feedback on the design of the interface, as well as our Rust AICI runtime.

I'm the lead developer on this project and happy to answer any questions!




This all clicked for me when I got to this example in the README:

    async def main():    
        # This is the prompt we want to run.
        # Note how the prompt doesn't mention a number of vehicles or how to format the result.
        prompt = "What are the most popular types of vehicles?\n"

        # Tell the model to generate the prompt string, ie. let's start with the prompt "to complete"
        await aici.FixedTokens(prompt)

        # Store the current position in the token generation process
        marker = aici.Label()

        for i in range(1,6):
          # Tell the model to generate the list number
          await aici.FixedTokens(f"{i}.")

          # Wait for the model to generate a vehicle name and end with a new line
          await aici.gen_text(stop_at = "\n")

        await aici.FixedTokens("\n")

        # Store the tokens generated in a result variable
        aici.set_var("result", marker.text_since())

    aici.start(main())
This is a similar pattern to llama.cpp grammars - it's a way to directly manipulate the token generation phase, such that you can occasionally take over from the model and say "here we are outputting '1.' - then back to you for the bullet item".

The most obvious usage of this is forcing a model to output valid JSON - including JSON that exactly matches a given schema (like OpenAI Functions if they were 100% reliable).

That Python code is a really elegant piece of API design.


> The most obvious usage of this is forcing a model to output valid JSON

Isn't this something that Outlines [0], Guidance [1] and others [2] already solved much more elegantly?

0. https://github.com/outlines-dev/outlines

1. https://github.com/guidance-ai/guidance

2. https://github.com/sgl-project/sglang


Each of those is great, and more sophisticated strategies are being developed all the time: incorporating static analysis at inference time to ensure code is more than just syntactically correct [0]; to multi-LLM call agent-based frameworks [1]; decomposing hard tasks into many "easier" LLM tasks [2]; analyzing internal activations to identify potential hallucinations [3]; activation steering to control generations [4]; and "Language Model Arithmetic" to compose biases for style, vocab, etc [5].

But each also either requires tight integration with an LLM inference/serving engine (to access low-level internals of an inference engine); or adds a lot of overhead (many individual LLM calls) if they don't have tight integration.

The AI Controller Interface is creating an abstraction layer that exposes the low-level primitives so that all the above strategies can be implemented without each one needing to dive into the internals of every LLM engine. AICI doesn't support all the necessary primitives for all of these yet (e.g., we don't know what the right way is to represent internal activations); and not everything will end up fitting in a WASM module anyway.

Its a start at thinking about a new layer in the inference stack

[0] https://arxiv.org/abs/2306.10763 [1] https://github.com/microsoft/autogen [2] https://arxiv.org/abs/2305.14292 [3] https://arxiv.org/abs/2312.02073 [4] https://arxiv.org/abs/2308.10248 [5] https://arxiv.org/abs/2311.14479


Kind of. The usual mechanism for those sorts of libraries is a grammar-informed-token-filter. You get an output adhering to the grammar (or an error), but the mechanism is reshaping some intermediate conditional probabilities.

That's problematic because the entire point of an LLM (in most scenarios with non-trivial output sizes) is sampling from a calibrated probability distribution. The conditional mask in generation allows you to factor that into per-token work (given the preceding text, what's the probability of the next token being token X?), but none of that holds even approximately when you start tweaking the sampling parameters (the same reason why playing with "temperature" gives very bad results for a lot of complicated problems -- the greedy solution we're executing isn't globally correct).

For a small, illustrative example that I've seen, suppose your trying to parse a freeform text input into a tagged text input (podcast transcripts or something), squishing the result into a json schema or something so that you can actually use it in your end application. For longer text fields, many SOTA LLMs are prone to close the field early and append an ellipses rather than flesh it out in its entirety. Constrained schema generation will have a high probability of closing the (broken) string and a low probability of continuing past the (incorrect) ellipses, contrasted with an approach of retrying till the result doesn't crash a parser, which usually will not emit a closing quote after an ellipses and this crashes all those times the output would have been invalid. The net result is that you still have an invalid result, but nothing crashed, so now you don't know that fact.

This particular library also suffers from that same flaw (though the flaw is important and common, so I think it's worth repeating myself about it once in awhile), but it has two very nice features I'd like to touch on (contrasted with spec/grammar-approaches), one of which partially mitigates that flaw:

1. You can interact with partial results via a turing-complete language. It's really messy to use weaker languages to encode "this one json field is probably broken if it ends in an ellipses, but other strings probably can and should", and

2. You have your results available directly in your language of choice without an additional parsing/extraction step. Resource-wise I don't know that it matters, but tons of simple problems are much more elegantly described as "I want these things" instead of "I want this composite object -> post-process that composite object".

That said, using it for json in particular looks terrible, like you said (but you could presumably wrap one of the other json libraries with this and reap some of the benefits of both approaches).


> That's problematic because the entire point of an LLM

Could you detail what the results of that problematic behavior are? Do you suspect structured generation performs worse than unstructured?

> squishing the result into a json schema

The strategies I've seen for dealing with structured generation seem to be ensuring that the structured is represented in the prompt, so the model isn't squishing anything.

> For longer text fields, many SOTA LLMs are prone to close the field early and append an ellipses rather than flesh it out in its entirety. Constrained schema generation will have a high probability of closing the (broken) string and a low probability of continuing past the (incorrect) ellipses,

Part of structured generation is specifying exactly the range you want for the field, so I'm not sure how this issue arises. With structured generation you force the LLM to have at least some number of characters, so this seems like an argument in favor of structured generation rather than against. If unstructured LLMs want to "close the field", then structured can force it to remain open.


> Could you detail what the results of that problematic behavior are?

I'll contrast two sampling approaches (ignoring prompting and whatnot since that's orthogonal and can be applied to either): (1) repeat till the answer adheres to a grammar, and (2) filter the set of possible next tokens to those which could adhere to that grammar.

The former preserves the same relative probabilities in valid answers that the base model would have, whereas the latter as an unknown, not intuitively explainable, and wildly differing distribution.

Anecdotally, one way in which that manifests is that when the model inevitably makes a grammar-admissible mistake on some token, the base model is more likely to also make a mistake in the grammar than it otherwise would have. Sampling strategy (1) throws that mistake away, where sampling strategy (2) silently forces the rest of the answer to comply, yielding an incorrect but "valid" answer. The long-string-ellipses problem I described falls into that class of problems.

The specific behaviors are hard to qualitatively describe in full generality though because of the huge number of ways in which a "wrong" probability distribution can be wrong.

> Do you suspect structured generation performs worse than unstructured?

Yes, often, not always. E.g, if your schema is enum{Red,Blue,Green}, you can prove that the two distributions are in fact equal, so structured generation would be strictly better because it's cheaper. For specialized problems, structured generation is another meta-parameter, and despite having no intuitive explanation for _why_ it performs better, if it accidentally performs better for your particular problem then that's a huge win. Even if it increases the error rate, it's also cheaper than the alternatives (and most reasonable applications of an LLM assume a nontrivial error rate anyway, so that's not necessarily a huge cost), so it might be "better" for an application despite lower quality results.

For complicated schemas though, I'd definitely at least want to measure the difference. Anecdotally, structured generation as a sampling procedure performs worse for me on complicated problems than unstructured generation.

> The strategies I've seen for dealing with structured generation seem to be ensuring that the structured is represented in the prompt, so the model isn't squishing anything.

"Squishing" was a bit of a colloquialism. For the vast majority of problems I've seen, fitting the grammar into the prompt, choosing a very very simple grammar, and filtering non-confirming responses is a pretty good approach. Your success rate is decently high (i.e., not much more expensive than sampling-based approaches), you get something sufficiently machine-readable to fit into your pipeline, and you have a sampling distribution matching the underlying LLM. For sampling-based approaches though, you skew the result distribution in the way described above, which I called "squishing".

> Part of structured generation is specifying exactly the range you want for the field, so I'm not sure how this issue arises. With structured generation you force the LLM to have at least some number of characters, so this seems like an argument in favor of structured generation rather than against. If unstructured LLMs want to "close the field", then structured can force it to remain open.

If you can enumerate all the failure modes (or likely failure modes), absolutely. My biggest counter-arguments are:

1. That's hard to enumerate in general (hence why we have LLMs instead of grammar-rule-machines).

2. Even for the ellipses example, if you want to use one of those structured json libraries, how hard is it to require json and also require that "certain" strings (over X length, with certain corresponding keys, ...) can't be ended as `..."`? I haven't done it, but I have enough programming experience to be pretty sure it'll be a pain in the ass unless you want to fork the library for your use-case or re-write most of its json grammar.

3. Suppose you get that constraint (running example, certain strings can't end via `..."`) into the grammar, what exactly does that do? Those algorithms are greedy, and when the model first inserts a mistaken period, it's already made a mistake. Periods are allowed in strings though, so the grammar continues. The model, having differed from the text it's transcribing and inserted a period, inserts another period because it's obviously creating an ellipses. This is still valid in the grammar, so text generation continues. If your grammar just banned ellipses, you'd get some other nonsense character or an end-of-string-quotation at this point, and if you banned string-ending-ellipses you'd almost certainly get another period, at least one nonsense character, and then an end-of-string-quotation. Despite the fact that you banned the bad behavior, in all the cases you would have seen it without structured generation, you still get broken outputs with structured generation and don't know that they're broken. Contrast that with letting the string complete and re-trying if it ends as `..."`. Every time the model mistakenly adds a period trying to end the string, that is explicitly caught and tracked. It's still "structured", but the result distribution is different and better.


Great summary - thanks. And definitely closer to the LLM api surface we need to really start to use these things.

There’s definitely a danger with this kind of code that the first thing this is going to generate to complete the “1.” prompt will be something like “ Truck; 2. Sedan; 3. Minivan”, though.


You generally have to tell the LLM what you want it to generate and then enforce it, i.e, the LLM has to be somewhat aligned with the constraints. Otherwise, for example if you ask for JSON it will keep generating (legal) white-space or when you ask for C code it will say:

  int sureICanHelpYouHereIsAnExampleOfTheCodeYouWereAskingFor;
In this case however, you can just do:

  await aici.gen_text(regex=r"[a-zA-Z\n]+" stop_at = "\n")
Also note that there still a lot of work to figure out how it's easiest for the programmer of an LLM-enabled app to express these things - AICI is meant to make it simple to implement different surface syntaxes.


Good point. I believe it could be solved with backtracking, just like it is done in compilers/lexers.


Isn't that also similar to the (formerly?) Microsoft Guidance project? https://github.com/guidance-ai/guidance


We believe Guidance can run on top of AICI (we're working on efficient Earley parser for that [0], together with local Guidance folks). AICI is generally lower level (though our sample controllers are at similar level to Guidance).

[0] https://github.com/microsoft/aici/blob/main/controllers/aici...


How does it compare to GBNF grammars or LMQL?


Each of these can be implemented on top of AICI. AICI lets you run an arbitrary (Wasm) program to determine which tokens are allowed next - this is strictly more expressible than CFGs.

In addition to constraining output, AICI also lets you fork the generation programmatically as well as dynamically edit the prompt which is important in some scenarios.

The idea is that several LLM inference systems implement AICI as an interface, and Guidance, LMQL, etc implement AICI backend and then everyone wins.


Is the benefit only that WASM (and things that target it) is used as the constraint language?


There are two parts to AICI - one is that it uses Wasm and the other is the specific set of operations on the LLM that it exposes (which is why we call it "interface").

Having a common interface (provided it is adopted) lets you re-use the same controller over multiple LLM infra stacks (and different models), or conversely the different controllers over your own brand new LLM infra stack (or model).

Using Wasm simplifies your deployment story in the cloud (sandboxing), however the AICI runtime communicates with controllers as separate processes and so if the controller is trusted it could in principle be run in native code.


Ah, that clears it up, thank you!


Although don't you still have problems with being able to accurately anticipate the output for control flow purposes? In this example, stopping at a new line. I guess that should work out fine in this example, although if I'm understanding this paradigm correctly, to help ensure that you'd want to have something like: await aici.FixedTokens(f"Here is a bullet point list:\n\n.") before entering into the for-loop.


And now I'm waiting for the "LLM" monad.


Wasm is monadic.


could you eli5 how wasm being monadic is helpful / and how an llm monad would be too?


It’s just a useful way to communicate about it, giving it a name. If you’re knee-deep in OOP land you can just call it a “builder”


I mean a lawful(ish) monadic type/api, i.e. `LLM a` in FP-land. Otherwise, every sequential machine/program can be said to be monadic.


> That Python code is a really elegant piece of API design.

        await aici.FixedTokens("\n")

        # Store the tokens generated in a result variable
        aici.set_var("result", marker.text_since())
Not sure I agree.


llama.cpp is orders of magnitude easier. Rather than controlling token by token, with an imperative statement for each, we create a static grammar to describe ex. a JSON schema.

I'm honestly unsure what this offers over that, especially because I'm one of 3 groups with a WASM llama.cpp, and you can take it from me, you don't want to use it. (~3 tokens/sec with a 3B model on MVP M2 Max/Ultra/whatever they call top of line for MBP. About 2% of perf of Metal, and I'd bet 10% of running on CPU without WASM. And there's no improvement in sight)


I find these imperative statements much less intimidating than llama.cpp grammars - the Python code I copied here looks a lot more approachable to me than the gbnf syntax from https://raw.githubusercontent.com/ggerganov/llama.cpp/master...

I don't think the key idea here is to run llama.cpp itself in WASM - it's to run LLMs in native code, but have fast custom-written code from end-users that can help pick the next token. WASM is a neat mechanism for that because many different languages can use it as a compile target, and it comes with a robust sandbox by default.


It's only the controller that runs in Wasm, not the inference.

The pyctrl is a just a sample controller, you can write a controller that takes any kind of grammar (eg., a yacc grammar [0] - the python code in that example is only used for glueing).

Llama.cpp grammars were quite slow in my testing (20ms per token or so, compared to 2ms for the yacc grammar referenced above).

[0] https://github.com/microsoft/aici/blob/main/controllers/pyct...


This is great generic idea.

It's also possible to wrap it in something user friendly a'la [0].

[0] https://github.com/ollama/ollama/issues/3019


What you describe there is a great example of a custom controller - could be implemented on top of pyctrl, jsctrl, or just natively in Rust.


It took me a while to also figure it out, until I got to the example.

I am definitely going to play with this.


Awesome. I wonder if you could use this with a game engine or something. Maybe the aici module could render a Nethack screen. And perhaps automatically reject incorrect command keys in context if they are errors (if integrated deeply enough).

Is it possible to combine this with some kind of reinforcement training system?


I'm really excited for LLMs in gaming, like an Ender's Game / Homeworld type game where you can shout orders and the units scramble around or Stellaris where you have actual discussions with the other factions. Local and reliable output enables that sort of stuff, though the perf tradeoff of LLM vs game rendering might be hard to deal with.


Probably worth mentioning Deepmind's SIMA if you haven't seen it yet [0]. It definitely pushes the state of the art for agent behavior in a typical game environment.

Perhaps in the future we won't code NPCs any differently from player characters- they'll have access to the full action space, with long-term memory and natural language prompts guiding their behavior.

[0]: https://deepmind.google/discover/blog/sima-generalist-ai-age...


The easiest thing would be to use supervised finetuning on the LLMs you're trying to control (most open-source LLMs have some sort of off-the-shelf system for finetuning), combined with this system to control the output. I suppose there's nothing stopping you from writing an RL training system to alter the model weights other than needing to write a bunch of code, though... Maybe LlamaGym (https://github.com/KhoomeiK/LlamaGym/tree/main) could reduce the amount of code you need?


Does it support constrained generation during training?

This is what we need for the large language models I am training for health care use cases.

For example, constraining LLM output is currently done by masking, and having this rust based library would enable novel ways to train LLMs.

Relevant papers:

https://github.com/epfl-dlab/transformers-CFG

https://neurips.cc/virtual/2023/poster/70782


It's definitely very exciting direction, which we have not explored at all!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: