Guidance: A guidance language for controlling large language models

simonw · on Sept 17, 2023

The thing I most want from this project is a technical explanation of what it's actually doing for me and how it works.

I dug into this the other day, and just about figured out how the old text-davinci-003 version works.

When it runs against a text completion model (like text-davinci-003) the trick seems to be that it breaks your overall Mustache-templated program up into a sequence of prompts.

These are executed one at a time. Some of them will be open ended, but some of them will include restrictions based on the rules that you laid out.

So you might have a completion prompt that asks for a maximum of 1 token and uses the logit_bias argument to ensure that the returned value can only come from a specific set of tokens. That's how you would answer a piece in the program that says "next should be just the sequence 'true' or 'false'" for example.

What I don't yet understand is how it works against non-completion models. There are open issues complaining about broken examples using it with gpt-3.5-turbo for example.

And how does it work with models other than the OpenAI ones?

verdverm · on Sept 17, 2023

I dug into this a while back, iirc, :handwavy: it comes down to "pausing" template rendering and calling the LLM with all content generated so far. https://github.com/guidance-ai/guidance/blob/main/guidance/l...

This is how we implemented it anyhow, with some more parameters to control how that all works (and the LLM params) at each "pause" point. The _neat_ part for us was that a template helper could make use of the partially generated content. Hadn't thought about that before for a templating engine, but was trivial to implement in the end

gsuuon · on Sept 17, 2023

I took a stab at making something[1] like guidance - I'm not sure exactly how guidance does it (and I'm also really curious how it would work with chat api's) but here's how my solution works.

Each expression becomes a new inference request, so it's not a single inference pass. Because each subsequent pass includes the previously inferenced text, the LLM ends up doing a lot of prefill and less decode. You only decode as much as you actually inference, the repeated passes only end up costing more in prefill (which tend to be much faster tok/s).

To work with chat tuned instruction models, you can basically still treat it as a completion model. I provide the previously completed inference text as a partially completed assistant response, e.g. with llama 2 it goes after [/INST]. You can add a bit of instruction for each inference expression which gets added to the [INST]. This approach lets you start off the inference with `{ "someField": "` for example to guarantee (at least the start of) a json response and allow you to add a little bit of instruction or context just for that field.

I didn't even try with openai api's since afaict you can't provide a partial assistant response for it to continue from. Even if you were to request a single token at a time and use logit_bias for biased sampling, I don't see how you can get it to continue a partially completed inference.

[1] https://github.com/gsuuon/ad-llama

adamgordonbell · on Sept 16, 2023

Is this microsoft guidance? It looks like it is and they spun it out.

I find guidance to be fantastic for doing complicated prompting. I haven't used the 'controlling' the output feature as much as used it for chain prompting. Ask to come up with answers to a prompt N times, then discuss pros and cons of each answer, then make a new answer based on the best parts of the output. Stuff like that.

bugglebeetle · on Sept 17, 2023

I’ve found using a JSON schema and function calling, as described in this blog post, to be just as effective and less opaque than this library:

https://blog.simonfarshid.com/native-json-output-from-gpt-4

(it works perfectly with GPT-3.5 as well)

hexman · on Sept 16, 2023

I found that the approach of template processing at large prompts leads to difficulty in reading programs. Their attractive part is that control flow is not separate from prompt as in langchain, which allows you to write prompts as classical programs. But the problem remains in unintuitive syntax for large programs

rckrd · on Sept 17, 2023

Logit-bias guidance goes a long way -- LLM structure for regex, context-free grammars, categorization, and typed construction. I'm working on a hosted and model-agnostic version of this with thiggle

[0] https://thiggle.com

lukasb · on Sept 16, 2023

Can anyone comment on how well this does at coercing json output vs OpenAI function calling?

verdverm · on Sept 16, 2023

This is just a different way to write prompts, it allows some interleaving of calls to the API so you can build things up, write a conversation as a single file, with conventions around the text to send to the LLM.

I would not expect it to make a difference in your current applications. Getting JSON is all about the model, training, and prompt, in that order

If you are looking for low-hanging fruit to improve your JSON responses from LLMs, fine-tuning will likely get you the most bang for your buck. Start from a coding model like codellama, code-bison, or starcoder

startupsfail · on Sept 16, 2023

For the local model it forces valid json structure and formatting tokens are being produced by code rather than generated by an LLM.

verdverm · on Sept 16, 2023

sounds like post-processing made out to be something more?

everyone is doing this, it's just part of the pipeline, certainly nothing innovative on that front in guidance

mmoskal · on Sept 17, 2023

It updates token logits (probabilities) after every token before sampling. I don't think this is very common yet.

newhouseb · on Sept 17, 2023

Right, there are many folks (dozens of us!) yelling about logit processors and building them into various frameworks.

The mostly widely accessible form of this is probably BNF grammar biasing in llama.cpp: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

Der_Einzige · on Sept 17, 2023

Still rare, but I wrote a whole paper last year about what happens when you use this functionality (a lot, including defeating any kind of RLHF!)

https://aclanthology.org/2022.cai-1.2.pdf

verdverm · on Sept 17, 2023

anecdotal counter evidence, I've seen multiple projects / papers manipulating the logits, it's a very common thing to think of doing now to improve performance (by eliminating bad options from consideration)

bugglebeetle · on Sept 17, 2023

OpenAI function calling + JSON schema is dead simple and has never failed for me, where as I had a bunch of errors with guidance when trying to do things like nested, repeating values.

verdverm · on Sept 17, 2023

Yeah, my problem with this is that you have to buy into their way of interacting with and calling an LLM. Seems more like Handcuffs than Guidance to me

guyrt · on Sept 16, 2023

I've been trying to figure out how projects like this, semantic kernel (also msft), and langchain add value. Is the paradigm sort of like a web framework? It reduces the boilerplate you need to write so you can focus on the business problem?

Is that needed in the LLM space yet? I'm just not convinced the abstraction pays for itself in reduced cognitive load, or at least not yet, but very happy to be convinced otherwise.

IshKebab · on Sept 17, 2023

It lets you actually control the output structure and more or less guarantee the LLM is doing what you want. Plus it reliably extracts structured results.

It's obviously extremely valuable if you're doing anything with the LLM output other than displaying it as a block of text to the user, or if you care about the output format at all.

losteric · on Sept 17, 2023

imo Guidance is valuable, the underlying logic is sufficiently complex that I'm glad I didn't need to DIY it. Same goes for the faster Outlines project from Normal Computing.

LangChain: I found having a framework useful to ramp up people without prior LLM exposure, in an open-ended experimental space. The library covers many usecases and gets people thinking. But honestly their documentation is somewhat lacking for that purpose (stale text, shallow examples). Personally, coming from a search background I was able to DIY semantic RAG in the time it took to figure out how to do the same thing in LangChain.

phillipcarter · on Sept 16, 2023

In my experience, they add cognitive load to working with LLMs, including when doing more than just calling an LLM, like RAG. But maybe others feel differently. I’m glad there’s variety.

verdverm · on Sept 16, 2023

Yea, in particular for this project, they have created a bespoke templating system.

You can get the same thing with Go text/templates by adding chat function(s) as custom a helper: https://github.com/hofstadter-io/hof/blob/_dev/lib/templates...

As a developer of these things, I don't get why they want to put so much effort into the mundane parts rather than focusing on the interesting parts. These things are mostly just the same as any other workflow or API call: https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/cmd... (unless you get into the python and (i.e.) start messing with the logits or token probabilities)

PUSH_AX · on Sept 17, 2023

The thing that’s bugging me about this eco system is the library, although it augments, has to become the thing running the LLM, I can’t use guidance as a plug-in on some other LLM system.

I look forward to when we have something that can run any LLM without compatibility issues, can expose APIs etc and has a robust plugin or augmentation system.

dave1010uk · on Sept 17, 2023

`llm` might be the closest thing to that right now.

https://github.com/simonw/llm

avereveard · on Sept 16, 2023

Is this alive? Last release June 21

There are many projects like these I'm tracking, but they all kinda cool off after the initial prototype and have thus many quirks and limitations

So far the only one that I could reliably use was llamacpp grammars, and those are fairly slow

verdverm · on Sept 17, 2023

> Is this alive? Last release June 21

How often does a project need to release to not be considered dead? It's only been 10 weeks, in the summer, at the peak of vacation time

Look at the most recent commits, they are setting up new governance, which likely took more than 1o weeks to work through the bureaucracy of Mircosoft

Forgotthepass8 · on Sept 17, 2023

LMQL seems to be alive and takes some of these concepts even further. It's the project of 1 or 2 PhD students at ETH Zürich so I'm hopeful they'll see it through.

I thought guidance was smart, but LMQL seems brilliant as it merges pythonic constructions with LLMs (I think it may be an outright superset of python with LLM functionalities?)

It's predicated off a paper as well : https://arxiv.org/pdf/2212.06094

verdverm · on Sept 17, 2023

LMQL requires a user to learn a bespoke programming language. Not a good idea, no one really wants to have to learn a new programming language to work with one library or framework. You have to have a really compelling offering. With LLMs, the libraries and frameworks are a dime-a-dozen, so it's going to be a much bigger ask of your users

Forgotthepass8 · on Sept 18, 2023

I see your point but at the same time I'm looking for alternatives and guidance isn't really alive and langchain is just... a lot of stuff(arguably bloat..) and I don't see any obvious easy value from it like I see in lmql/guidance.

gsuuon · on Sept 16, 2023

I'm hacking on a library (https://github.com/gsuuon/ad-llama) inspired by guidance, but in TS and for the browser. I think structured inference and controlled sampling are really good ways of getting consistent responses out of LLM's. It lets smaller models really punch above their weight.

I wonder what other folks are building on this sort of workflow? I've been playing around with it and trying to figure out interesting applications that weren't possible before.

maccam912 · on Sept 16, 2023

I've seen this link pop up in various places now, but it seems like it's still mostly not being developed? Is there a reason it was posted today? Some new development in it?

verdverm · on Sept 17, 2023

they are changing the governance and contributors, maybe in prep to do something more or raise money? Every AI library seems to try that path these days

Somehow, the VCs and investors made us think it was cool to be working for them rather than our users

ilovefood · on Sept 17, 2023

I've been using this library a lot, it's amazing. However, I noticed a very considerable degradation (time taken + generation quality) with versions > 0.0.58 when used with local LLMs.

I haven't taken time to compare between the different releases but if anyone is having the same type of issues, I recommend downgrading even if it might mean less features.

startupsfail · on Sept 16, 2023

This seems just a clone of Microsoft Guidance.

simonw · on Sept 17, 2023

This IS Microsoft Guidance, they seem to have spun off a separate GitHub organization for it.

https://github.com/microsoft/guidance redirects to https://github.com/guidance-ai/guidance now.

verdverm · on Sept 17, 2023

Simon in the house to clear the air!

Do you have similar capabilities in your LLM project?

simonw · on Sept 17, 2023

Not yet. I've been investigating llama-cpp grammars recently with an eye to getting those working in LLM - maybe even using them to get an equivalent to OpenAI Functions working (which I'd like to include in LLM too).

Notes on grammars here: https://til.simonwillison.net/llms/llama-cpp-python-grammars

verdverm · on Sept 17, 2023

It's still unclear to me if that is the right direction or a scalable solution.

Seeing how far prompt engineering can get you on this from (specifying the grammar as a one/few shot) does pretty well, it seems like something that could be better handled at training/refining time? My general feeling is that it is inserting a specific and complicated cog, and probably has to be tweaked for each model (as they all have their little quirks)

Curious what you think having been working directly on these things?

From my experience, you can get an LLM to follow a "grammar" for pretty much anything, without it being an actual grammar spec in one of the many formats. You can pretty much make it up. Here's an example of us getting CUE out of a model by giving "tricking" the LLM to generate JSON with less syntax (a subset of both CUE and JSON). Bonus, fewer quotes and commas meant fewer tokens. We turn this into JSON afterwards, works surprisingly well

https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/pro...

simonw · on Sept 17, 2023

I don't trust most LLMs to reliably follow instructions to "only output JSON with no extra text". Llama 2 for example really isn't very good at following those kinds of instructions in my experience.

I really like how grammars offer a realistic path to getting completely dependable formatted output from these models.

verdverm · on Sept 17, 2023

interesting, I hadn't thought about the problem where they don't follow the instruction "only output the JSON, do not add explanations or other text"

I haven't pushed on codellama2 much yet, but my initial experiments, it did not really output anything extra, and my prompt became a one-liner compared to the really long instructions I had to give chatgpt for controlling output. Shows how far you can get with a purpose trained model

fine-tuning is important to getting more consistent output, none of the smaller models (open-source sized) are going to get there with just few-shot. Sounds like the grammar logit influencer is a low-cost/effort way to constrain output without the fine-tuning cycle. I can imagine they might be better together, but my hunch is that fine-tuning will still dominate the improvements and consistency. If you don't have the training data, that is a very good reason to use this technique too

simonw · on Sept 17, 2023

My favourite joke about LLMs not following formatting instructions is this from Riley Goodside: https://twitter.com/goodside/status/1657396491676164096