Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: LLMs can generate valid JSON 100% of the time (github.com/normal-computing)
854 points by remilouf 11 months ago | hide | past | favorite | 303 comments
Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.

Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.

Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.

From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.

I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.

I look forward to feedback, bug reports, feature requests and discussions!

Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702

Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!

I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?

I'm not sure of why you would want to use raw llama-2 though when there is a million super strong instruction fine-tuned versions of llama-2 on HF hub that would do the job a million times better? Like Stability-AI's Beluga-2. See https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

Don't rely too much on automated benchmarks for LLMs. They are often gamed, made to overfit, and result in worse performance in the general case.

Human evaluation is the gold standard and the Llama 2 paper gave significant evidence that Llama 2 70b chat is on-par, if not, better than ChatGPT for that metric so I tend to stick to it unless there is good reason not to.

The problem with Llama 2 chat versions is that they have been RLHF-ed to death. You can't ask questions without getting a sermon of how your question may be inappropriate for this or that reason.

I think it's worse on the smaller models, but still present in the 70B one.

Apologies if you’d already seen this and were only trying to make a point, but you might like this article from a week or 2 ago that talks about how to run Llama 2 “uncensored” locally, and it seems to do a decent job of mitigating the sermons!

Article: https://ollama.ai/blog/run-llama2-uncensored-locally

Discussion: https://news.ycombinator.com/item?id=36973584

When you encounter "uncensored" in a llama model (1 or 2) what that means in that context is that the fine-tuning datasets used have had all refusals to respond removed. There's no way to uncensor the pre-trained model itself and fine-tuning only changes the style of the output.

For sure, that's a good reason for using the uncensored fine-tuned versions. There are other good reasons too like expanded context size, codegen, and story writing/rp. Just be careful of extraordinary benchmarks.

Btw, have you tried changing the default Llama 2 chat prompt? Meta tried to fine-tune it so that if you remove the safety part from the prompt, safety won't be applied[1]. Not sure how well it works myself, but worth a shot I guess

[1] can be found in the Llama 2 paper

> I'm not sure of why you would want to use raw llama-2

Sure. My concern was not specific to llama-2, and was only using it as a placeholder example of a decent pre-trained base model. Replace it with your favorite base model, which you want to use for guided generation. My question is more fundamental - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

> About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

Mechanistically, yes. I am not arguing that. The whole point is to generate JSON that is "useful".

I'm quite impressed with Llama 2 13B - the more time I spend with it the more I think it might be genuinely useful for more than just playing around with local LLMs.

I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.

Even the 7B model is shockingly good! I've been hacking on a project also built on MLC (but the web runtime) and the completions I'm seeing from Llama 2 7B, just running on my laptop's browser, have been really impressive. There's a demo page here: https://ad-llama.vercel.app/

That demo is really cool!

What are your use cases

The thing I really want to get working is retrieval augmented generation - so effectively answering questions based on a blob of context that I pass in, and being able to do good-enough summarization.

I haven't quite proved this to myself yet but I think it's going to work pretty well.

Not simonw, but I've been using Llama2-13B for search re-ranking very successfully.

search re-ranking?

Do a search, then re-order the results based on a criteria. Easy when the criteria is easy to code, less so when it isn't. But turns out LLMs are pretty good at interpreting the re-ranking instructions.

In our experience, at least for code generation, the experience has been that base models can be improved significantly by guiding token level generation.

In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.

Thanks for the reference, Lakshya. Looks very cool!

(Just thinking out loud next)

If you allow me to be a little imprecise, guided-generation is prompting "just-in-time" unlike the other kind of prompting where you provide all reference tokens "ahead-of-time". Now there's work [1] out there that shows that smaller models rely much more on prompting than larger models do, i.e. smaller models are more faithful to the tokens in the prompt than the larger models which just do whatever they were going to do anyways.

Your results seem very much in line with this kind of a qualitative result --- you show that CodeGen-350M outperforms CodeGen-6B, and CodeGen-6B outperforms text-davinci-003 using MGD. Smaller models perhaps respond more strongly to certain kinds of prompting strategies than larger models do.

[1]: https://arxiv.org/pdf/2307.13702.pdf

It is an interesting paper. Any idea when the code/data will be released? It appears it has been almost 2 months since the paper was submitted, but the link given leads to a random bing page :-(

> ...given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution...

Isn't that what we did with test driven development?

The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?

Yes. And if that human was smart and knowledgable they would use property based testing to automatically generate test inputs. Most libraries make it trivial to do for custom data types and can even reduce the failing test case to a minimal size input. I have been using this since 2008 and it was around before that.

I think what I am saying is tangential to TDD. I am not really even concerned about the ability of LLM to function as desired, and its verification.

I was rather concerned about a broader fundamental question - how does post-hoc guided generation interfere with the potential benefits of instruction-tuning?

>you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.

With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.

Or put differently the % of cases that are edge cases seems to have gone up dramatically

I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.

But it's still probabilistic, and nine times out of ten isn't good enough.

Occasionally it will hallucinate responses like this:

{"key1": "value1", "key2": "value2" for i in range(n)}

Re-prompting with the parsing error message is usually enough to get it on the second try.

But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.

Re-prompting for escaping errors still yields a ~50% success rate.

That re-prompting on error trick is what this new Microsoft library does, too: https://github.com/microsoft/TypeChat

Here's their prompt for that: https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...

I think the approach using grammars (seen here, but also in things like https://github.com/ggerganov/llama.cpp/pull/1773 ) is a much more elegant solution.

A "repair prompt" instead of rewinding and starting back from the error seems like the wrong choice, and might only make sense with how payment for OpenAI API usage currently works.

I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.

I've had the same experience as well. I suspect if it's due to large presence of HTML in the training data as part of codebases and online content

How do you imbue XML with meaning?

XML Elements themselves: their naming, their attributes, comments, indentation. There's more opportunity at every level of the hierarchy to demarkate and establish meaning. Having closing-tags as well, I've found, is a massive boon; LLMs can better understand what "finishing" looks like if its delimited in a semantic way - with a name.

Same works for JSON. Naming JSON keys works for adjusting what the output is nicely, and you can comment in your definitions (by defining them in a JSON Schema, or inserting placeholder text like `"someKeyWithClarifyingDetails": <some detailed instruction>`)

I'm actually partial to CSV these days though, it can really cut down on response times just not needing to return all the extra tokens for JSON/XML delimiters

Ostenibly yeh JSON should be able to encapsulate mose of that semantic stuff but having replaced an XML schema in the system prompt with gpt's function-calling API I've been very umimpressed. It feels much less capable. I would have to provide a lot more clarifying prompts to make it more capable. I think I will, for now, bias to using schemas that are closest to prose.

Yikes. This makes me think that JSON's stubborn mistake of not allowing comments is yet another "Billion-Dollar Mistake", since it's way too late to just change the standard to allow comments, update all the JSON content on the internet to use comments, and retrain all the LLMs to understand comments.

Great point about CSVs! But using placeholder keys for JSON comments in untenable, and using schema instead of inline comments is clumsy and indirect. Of course JSON schema are quite useful in certain situations, but LLMs would get a lot more meaning out of casual common JSON if it just allowed comments, and it would also greatly benefit humans.

Between JavaScript's and JSON's mistakes, that's at least <DoctorEvilVoice>THREE BILLION DOLLARS!!!</DoctorEvilVoice> ;)


>Speaking at a software conference in 2009, Tony Hoare apologized for inventing the null reference:

>"I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years." -Tony Hoare


>"My favorite is always the Billion-Dollar Mistake of having null in the language. And since JavaScript has both null and undefined, it's the Two-Billion-Dollar Mistake." -Anders Hejlsberg

>"It is by far the most problematic part of language design. And it's a single value that -- ha ha ha ha -- that if only that wasn't there, imagine all the problems we wouldn't have, right? If type systems were designed that way. And some type systems are, and some type systems are getting there, but boy, trying to retrofit that on top of a type system that has null in the first place is quite an undertaking." -Anders Hejlsberg

I'm not saying use placeholder keys: the actual keys themselves serve as guidance.

Naming a key "nameBasedOnLocationIGaveYou" instead of "name", or "oneSentenceSummary" vs "summary", results in a meaningful difference.

You can even use that for formatted single-response chain of thought, like {"listOfStuff":[...], "whatDoTheyHaveInCommon": "", "whichOneIsMostImportant": ""}

Also remember, the LLM doesn't need valid JSON: I just straight up insert comments in the JSON in a non-compliant way for some of my prompts, GPT-4 and Claude are all smart enough to not hallucinate comments back at you. 3.5 might be pushing it if temp is too high (although even the nerfed API logit bias should fix that now that I think about it)

And sometimes to save tokens I describe a JSON object without using JSON: just structure it in neatly formatted markdown and even 3.5 can follow along

Oh, I see! I misunderstood that you meant using dummy keys to hold comments in their values, which some people have suggested as a work-around for there not being any comments in JSON.

With ChatGPT function calling I get valid JSON 100% of the time from GPT-4 unless I have made some error in prompting.

The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.

As long as you've accounted for these failure modes, it works flawlessly.

GPT-4 is amazing, but the upside of smaller models is much lower cost. I get basically 100% accuracy on JSON modeling with GPT-4 with function calling too, but I will say that gpt-3.5-turbo with function calling is somewhat less accurate — it usually generates valid JSON in terms of JSON.parse not exploding, but not necessarily JSON following the schema I passed in (although it's surprisingly good, maybe ~90% accurate?). I use 3.5-turbo a decent amount in API calls because it's just a lot cheaper, and performs well enough even if it's not gpt-4 level.

I haven't gotten a chance to earnestly use the smaller Llama models yet in more than small prototypes (although I'm building a 4090-based system to learn more about finetuning them), but the little amount of experimenting I've done with them makes me think they need a decent amount of help with generating consistently-valid JSON matching some schema out of the box. This is a pretty neat tool to use for them, since it doesn't require finetuning runs, it just masks logits.

claude-1.2-instant came out last week and is doing extremely well at following schemas.

I'd say it's reached 3.5 turbo with the format following skills of GPT-4, which is powerful once you give it chain-of-thought

The premise of function calling is great, but in my experience (at least on GPT-3.5, haven't tried it with GPT-4 yet) it seems to generate wildly different, and less useful results, for the same prompt.

GPT-3.5 is pretty much useless for reliable NLP work unless you give it a VERY proscribed task.

That's really the major breakthrough of GPT-4, in my mind, and the reason we are absolutely going to see an explosion of AI-boosted productivity over the next few years, even if foundation LLM advancements stopped cold right now. A vast ocean of mundane white collar work is waiting to be automated.

You can change the randomness value to 0 and get the same output each time for the same text

In my experience (with GPT-4 at least), a temperature of 0 does not result in deterministic output. It's more consistent but outputs do still vary for the same input. I feel like temperature is a bit more like "how creative should the model be?"

One theory is it is caused by its Sparse MoE (Mixture of Experts) architecture [1]:

> The GPT-4 API is hosted with a backend that does batched inference. Although some of the randomness may be explained by other factors, the vast majority of non-determinism in the API is explainable by its Sparse MoE architecture failing to enforce per-sequence determinism.

[1] https://152334h.github.io/blog/non-determinism-in-gpt-4/

I should probably re-test it, but I think it wasn't the temperature. The results were unusually useless.

Meh... I asked GPT4 to return a sample PHP code inside of a random JSON. It failed the JSON linter from the very first try. I actually couldn't pass the validation despite many retries, eg follow up corrections. Not a single time it generated a 100% valid JSON, I eventually gave up.

if you think that's bad, try to get it to generate Inform 7 games—Inform's natural-English-ish syntax completely throws all LLMs for a loop, consistently. it generates code that looks possibly correct (to an Inform newbie at least), but fails to compile far more often than not. I find this super interesting.

This worked with chatGPT: create a sample hello world in php

store that code in a json[object

code: { "php_code": "<?php echo 'Hello, World!'; ?>" }

I see grammar constrained generation for 2 major advantages:

1. It consumes fewer tokens, no need to add too many examples into the prompt.

2. It suffers less from the forgetting issue.

Another minor advantage is you can control precisely where your desired output to begin with.

But overall, those are nice perks not too substantial IMO.

What about reprompting with a different temperature value?

If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons

I wonder if the next iteration of OpenAI features is something like:

right now you can inject prompts that the LLM takes into consideration before the output

I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"

It’s starting to feel like LLMs are to “classical” software engineering what quantum physics was to classical physics

How so? I’m not quite following the analogy.

Just guessing what was meant, but quantum physics in some sense tries all possible paths before an outcome is selected.

The problem with that is that without a quantum computer, or without some sort of filtering, that process can take up to infinite time.

Oh it was just a glib way of moaning about non-determinism making its way into software engineering. Much like how physicists had to make peace with the probabilistic nature of quantum physics.

Why wait for OpenAI?

>I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten

But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.

Can you please explain what is batching ? any pointers?

This is what we do, but for GPT-3.5. And it doesn't need to be system messages either. We even have it emitting only JSON in a specific structure (except for when it fails to produce an output altogether). This is without the function calling model.

It took some iterations but I've managed to get the OpenAI API to give me valid JSON 100% of the time now(based on my testing). I think I put in the prompt to never use newlines because it was causing issues lol.

Yeah same thing. I have done the same with GPT-3.5. Simply ask it to output using provided schema only and give a few examples. Always outputs in provided json format

What about using ChatGPT’s new function calling mechanism?

That returns broken JSON a lot of the times too

A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?

As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".

The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.

As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.

In this case (multiple choice generation), if one of the possible outputs does no match the regex, you can just exclude it from generation.

I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

An example from an earlier comment of mine on a different thread (assuming I've understood correctly):

> let's say we had a grammar that had a key "healthy" with values "very_unhealthy" or "moderately_healthy." For broccoli, the LLM might intend to say "very_healthy" and choose "very" but then be pigeonholed into saying "very_unhealthy" because it's the only valid completion according to the grammar.

That said, you can use beam search to more or less solve this problem by evaluating the joint probability of all tokens in each branch of the grammar and picking the one with the highest probability (you might need some more nuance for free-form strings where the LLM can do whatever it wants and be "valid").

This is a concern of mine, as well as limiting the amount that an LLM can talk through a problem - sometimes to nothing. Getting them to work through things IMO dramatically improves their output.

My gut feeling is that taking the output and if it's broken then start fixing it would have a better result - you could even then completely limit the output to only valid json. For your example, if it wrote "very_healthy" and was given an error message explaining that this wasn't an option it had to choose from very_unhealthy" or "moderately_healthy" I would expect a halfway decent model to pick "moderately_healthy".

This has the benefit of allowing you to use a more powerful model for reasoning (like GPT4) and a local model where you can do this kind of token probability manipulation for just fixing the data.

The multiple choice example was just for tractable computations and illustrative purposes. Pretend the LLM has characters===tokens and is doing autoregressive probability prediction as per usual -- "f"-25%, "h"-50%, "g"-25% to start with, and then appropriate probabilities thereafter to yield that multiple-choice example (plus an <end-of-string> token).

> I am trying to think of an example where "answer prefix might have been extremely unlikely to yield a valid response, but the technique ( ... ) constructs a valid response from it regardless", which might really cause a problem. But to no luck. Anyone has any idea? This could potentially be an interesting research question.

At one point in the past ChatGPT (at a model probability layer, not just because of the context window issue) was prone to truncating long JSON responses, and if that happened in a long string field then you'd see the observed behavior. An example application:

(-) You're asking the LLM to turn some written podcast description into something machine-readable. You chunk the input, feed each chunk into the model (somehow; ignore the details; they're not important), and turn paragraphs into {speaker_name: str, timestamp: str, content: str} blobs.

(1) The LLM is prone to turning long paragraphs into `{"content": "the beginning of the content...` patterns, using ellipses to indicate that there's more to that JSON object.

(2) If you actually retry till the LLM succeeds, it's leaps and bounds more likely to end that string with a quotation mark if the string has all the original input. I.e., output like `{"content": "the beginning of the content..."}` is comparatively rare.

(3) The article's technique, however, always morphs those truncated json blobs into valid json. Since the ellipses is _valid_ at that point (a sub-string), instead of the vast majority of inputs failing you instead end up with the vast majority succeeding and having an incorrect ellipses sub-string.

In general, the LLM does autoregressive completions. Imagine two prefixes P1 and P2, each of which can be completed by classes of data so that P1{G1} adheres to the grammar, P1{F1} fails to adhere to the grammar, P2{G2} succeeds, and P2{F2} fails. With retry-till-passing-grammar the weighted probabilities are:

P1{G1}: Chance[P1] Chance[G1 | P1]

P2{G2}: Chance[P2] Chance[G2 | P2]

Whereas the weighted probabilities produced by the technique are:

P1{G1}: Chance[P1]

P2{G2}: Chance[P2]

In both cases you'd need to divide by the total probability, but the convolution by conditionals is both important and notably absent. For very simple schemas like {sentiment: "positive"|"negative"|"neutral"} the results might potentially be similar, but nothing in the idea of a greedy token filter forces that constraint.

Relevant; LLama.cpp implemented grammar-based sampling last month.

https://news.ycombinator.com/item?id=36819906 https://github.com/ggerganov/llama.cpp/pull/1773

We can extend our approach to grammar-based sampling, as explained in the paper linked above. Relevant PR: https://github.com/normal-computing/outlines/pull/178

Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.

It should hopefully be a quick change to llama.cpp to add a mask per grammar state to bring it in line with your generation method; I don't think the two are incompatible, thankfully.

I do wonder how much you win here by masking the tokens? You still need to iterate along the output vector to apply the mask. Masking on the accelerator still requires filtering on the CPU side? Compared to running the language model, the cost of iterating over the edges in the grammar seems small.

Yes! This is closer to the approach I took in my port of llama.cpp's grammar support to PyTorch: https://github.com/Shopify/torch-grammar/blob/main/torch_gra... ... it generates a tensor mapping each PDA stack to a map of which tokens are acceptable from that state. It seems like a much better way to do it than looping over the sampled tokens on each turn.

We also had an implementation of grammar-driven guidance around the same time: https://github.com/normal-computing/outlines/pull/131. I imagine many others did as well, given all the papers we found on the subject. The point of this and our ongoing work is the availability of very low cost guidance, which was implemented a while ago for the regex case and expanded upon with JSON.

Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.

> it's astounding that the first-party platforms haven't done this yet

I was under the impression LLM tech is currently in a breakneck arms race and that things are dramatically changing every few months. It could simply just be a consequence of limited developer resources. It would be "astounding" if decade-old tech were missing such a fundamental feature, but for AI tech in arms-race mode it seems reasonable that they are still missing QoL features.

I think they meant that you'd expect simpler/more obvious ideas to be implemented first.

Thanks! We have extended the approach to grammar-based sampling. We describe the approach in the paper linked above. The following PR is relevant: https://github.com/normal-computing/outlines/pull/178

Could this same approach be applied at training? If the guidance does a lot of the syntactical heavy lifting, would that create the opportunity for a model to use the weights for something else. Essentially not bothering to reduce the error of things that the guidance will stomp on anyway.

Hi, the paper at https://arxiv.org/abs/2306.10763 titled "Guiding Language Models of Code with Global Context using Monitors" shows how to have the language models generate code without hallucinated dereferences.

I'm not sure how this is different than:










Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.

Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)

Thanks for bringing clownfish and relm to my attention! afaik other libraries loop over the entire vocabulary at every step of the generation. We on the other hand build an index at initialization by looping once over the vocabulary. Then generation is just as fast as standard generation.

torch-grammar generates a mask per PDA stack... we don't try to compute all the possible stacks. I'm sure there's something smarter that could be done here and you've probably figured it out (though IIRC regular languages don't have the arbitrarily recursive stack problem that you get when you get to context-free languages?) anyway, in practice we spend a few milliseconds on the first few requests building caches and then just apply masks from caches after that.

Sorry for misrepresenting your work. Thank you for correcting me and the explanation. Will take a closer look.

Hi, author of ReLM here. We use automata as well, like you describe, if I understand correctly.

So to explain this another way:

After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?

Very slick!

You would also need to keep generating until the whole string is valid. And what if it gets caught in a loop?

Not sure how this can really guarantee 100%

> And what if it gets caught in a loop? Not sure how this can really guarantee 100%

It's not great but after some timeout you can just set the mask to only include closing brackets.

You would still have to ensure balancing somehow. Both "]" and "}" are valid "closing brackets" and the correct one to choose is context-dependent.

You can determine which brackets you need in which order by parsing the incomplete json which was generated so far.

That won't do it, also need to close other stuf

{"this": "is valid json so farrrrrrrrrrrrrr

But yeah the general idea makes sense. Once you hit a timeout, change the mask to things that will close existing open things in a valid manner (}, ), ], ")

Same problem with normal sampling - if it doesn't pick the <end> token, you're stuck generating until you hit some stopping heuristic (max tokens, timeout, etc.)

Indeed. And we're able to update the mask with a dictionary lookup instead of looping over the entire vocabulary (slow!).

You also need some kind of beam search or rejection sampling since JSON tokens to not exactly correspond to logits.

edit: They describe this more carefully in the paper.

It’s actually a very old trick. Lots of libraries do this. idk what’s the big deal about this one.

Perhaps I didn’t explain clearly enough in the original post?

Is this Brandon Willard the breakdancer from Detroit Brandon Willard?

Edit: It is! https://brandonwillard.github.io/

Ha, yeah, in a distant, but really fun, past!

Hi, remilouf. You say that your background is in "probabilistic, relational and symbolic programming". In that case I suspect you understand that it is no problem to generate text from a regular or context-free grammar, or really any level of grammar. For example, you can do that very easily in Prolog (a relational language) given a grammar in Definite Clause Grammars notation.

As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

Wouldn't that generate an entirely random but valid output? Here you want a valid output related to the request.

> And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

So that you can parse unstructured text from a person and return structured data for a machine.

>> Wouldn't that generate an entirely random but valid output?

No. Grammars don't generate entirely random output. Even Probabilistic Context Free Grammars can generate deterministic output, depending on how they are sampled. The output can be related to some input, if desired, for example one can give a string with "holes" (variables) as input and have the holes filled-in by the grammar.

>> So that you can parse unstructured text from a person and return structured data for a machine.

If you are willing to spend the effort to write a grammar, you can do that without an LLM.

I wasn't talking about deterministic Vs nondeterministic.

> If you are willing to spend the effort to write a grammar, you can do that without an LLM.

How are you taking, for example, a request to make a "fun but not over the top character from the middle ages, with relevant weapons and a backstory. Game theme is a world populated by anthropomorphic vegetables." And get back a character for the game in a specific JSON format without the LLM in your design here? That's not encodable in the grammar.

As far as I can tell you won't be able to use the approach proposed here to create a character matching your above description unless every element of it is encoded in the guiding grammar (including the possibility for the character to have middle ages-relevant weapons, and the anthropomorphic vegetables).

At which point, again I have to ask: what do you need the LLM for? You've already done all the hard work by hand and the LLM is only adding some extraneous natural language parsing on top.

Plus, if you already have the grammar that can cover the anthropomorphic vegetable world it's only a bit more work to use it to parse such natural language requests, anyway.

I think people forget that grammars were the staple for parsing natural language and stuffing it into structured form for a very long time before LLMs, and they still mostly are.

The point is that if you have structure, someone has to hand-craft that structure. Frex, if you have a language with a compiler, someone has to write the compiler. Then, if you want to make some unstructured text conform to your hand-crafted structure, you can only do that to the extent that the unstructured text itself is made up of elements of the structured form. If you have a grammar for frogs and blueberries, and write a poem about the dawn and foxes, you can't use the former to structure the latter, no matter what you do, and LLMs won't make this happen magickally, either.

Essentially, your grammar is a type and any unstructured text you want to convert to a structure with your grammar must be a value that you can cast to that type.

>> I wasn't talking about deterministic Vs nondeterministic.

Then what? What do you mean by "random string"?

> I think people forget that grammars were the staple for parsing natural language and stuffing it into structured form for a very long time before LLMs, and they still mostly are.

This is a rewritten history of natural language processing tech. Years of fine-tuned theory-heavy grammar coding for parsing and generating human language got the field basically nowhere.

> As far as I can tell you won't be able to use the approach proposed here to create a character matching your above description unless every element of it is encoded in the guiding grammar (including the possibility for the character to have middle ages-relevant weapons, and the anthropomorphic vegetables).

You wouldn't need to, that's the point here. You let the LLM work on generating semantically valid responses and use a tool like this to restrict it to syntactically correct ones.

Here's an example jsonschema (a bit handwritten so maybe some errors but it should be clear enough). Let the LLM deal with coming up with a name and backstory that work, making sure the description and type of the weapon make sense (gpt4 suggested a close range carrot dagger for example), and let this work as your type structure.

      "type": "object",
      "title": "character",
      "properties": {
        "backstory": {
          "type": "string"
        "weapons": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {
                "type": "string"
              "description": {
                "type": "string"
              "weapon_type": {
                "type": "string",
                "enum": ["ranged", "close", "magic"]
              "range": {
                "minimum": 0,
                "maximum": 150
              "damage": {
                "type": "number"
            "required": [
        "name": {
          "type": "string"
      "required": [

> Then what? What do you mean by "random string"?

Nonsense. Like "Colorless green ideas sleep furiously" the famous sentence that's grammatically correct but utter nonsense.

> Plus, if you already have the grammar that can cover the anthropomorphic vegetable world it's only a bit more work to use it to parse such natural language requests, anyway.

I really do not think this is the case. Parsing and understanding arbitrary requests about something like this?

>> Here's an example jsonschema (a bit handwritten so maybe some errors but it should be clear enough).

That'd be nice, but it's not how this tool works. If you look at the repo, there's an example of following a json schema or pydantic model. It's clear that if you wanted a "carrot dagger" in your json, you'd need to define it beforehand:

  class Weapon(str, Enum):
      sword = "sword"
      axe = "axe"
      mace = "mace"
      spear = "spear"
      bow = "bow"
      crossbow = "crossbow"
But perhaps I'm underestimating the tool's capabilities. If so, hopefully remilouf can correct me (and give an example of how the tool can be made to work as you want it).

>> I really do not think this is the case. Parsing and understanding arbitrary requests about something like this?

Not arbitrary. See my casting-to-type analogy. The point I'm trying really hard to get across is that generating free-form text is all nice and cool, but if you want to give it structure, you need to have the entire structure defined before-hand, otherwise the text that can't be made to conform to it simply won't.

So if you haven't got anthropomorphic vegetables in your json schema, your LLM may generate them, they'll never end up in your json.

> It's clear that if you wanted a "carrot dagger" in your json, you'd need to define it beforehand:

No, only if you want to explicitly limit it to a set of options. You can have freeform fields, just like the jsonschema I provided. If you look at the example there's a character name which has a constrained length but is not limited to a set of options:

    class Character(BaseModel):
        name: constr(max_length=10)
        age: int
        armor: Armor
        weapon: Weapon
        strength: int
The name there can be anything you want. This tool is, unfortunately, outrageously slow so I put the json schema above with a few fixes into jsonformer and downloaded a small model and used it to convert the GPT4 description into valid json:

    "backstory":"Born in the tranquil meadows of Veggie",
            "name":"Leek Lance",
            "description":"A long, green and white lance made from a leek",
            "name":"Carrot Dagger",
            "description":"A short, pointed dagger. It's sharp",
    "name":"Sir Turnip Thistlebrook"
> Not arbitrary.

Well exactly. If you want to support arbitrary requests while constraining the output, tools like this are an easy approach and I'm not sure what else comes close. An interactive character design flow would have something like the above as the defined output and you could just keep asking for alterations as a human would ("make it more whimsical" or "not a king, something lower class") and have useful structured output

> See my casting-to-type analogy. The point I'm trying really hard to get across is that generating free-form text is all nice and cool, but if you want to give it structure, you need to have the entire structure defined before-hand, otherwise the text that can't be made to conform to it simply won't.

The structure, sure. But the content can be extremely varied.

Thanks for the demonstration. Well, maybe I did understimate the tool after all, although I'd prefer to see the entire session (prompt, grammar, and all the interactions) to be fully convinced.

I suspect though that the reason the tool was "outrageously slow" in your experiment is that you gave a very general grammar. Constraining it more (by giving exact descriptions of weapons) would perhaps make it work faster.

Also, it's obvious that while you'll get valid json like that, you have no guarantee that the contents will always match your request. This time you got a carrot dagger (again- I'd like to see the prompt that led to that, please), next time you might not.

Happy to help, my current focus is on llms and how to understand them (pros, cons, how to use them safely and where they can fit into your workflow) so opportunities to talk through these things are useful for me.

> I suspect though that the reason the tool was "outrageously slow" in your experiment is that you gave a very general grammar

Actually even smallish ones caused problems but jsonformer (a similar tool) worked fine. Not sure what the issue is with this one, I couldn't get it to complete. Not sure if I've got the hacked together code I used to get the json, I was using very small models which didn't help but my internet is slow and I couldn't load anything decent in the time so some of the testing was "here's an llms jsonish output, fix it to this exact schema". Smaller models needed more hand holding. Gpt2 had no idea how to deal with it.

For jsonformer the grammar was near identical to what I posted before, I fixed a couple of typos I think.

Personally the flow of:

Reason about the problem

Write in english

Convert to JSON

- use a tool like this to fix broken JSON

Is a workflow I think is very applicable (you can use different models for any step too).

> again- I'd like to see the prompt that led to that, please

Sure, that was from gpt4, which actually was either fine or decent if given the jsonschema.

Here's the original prompt and the full response that had a full backstory:

> fun but not over the top character from the middle ages, with relevant weapons and a backstory. Game theme is a world populated by anthropomorphic vegetables


It's a shame you can't use some of these tools with gpt4, it's in a class of its own.

> Also, it's obvious that while you'll get valid json like that, you have no guarantee that the contents will always match your request

Yeah absolutely. You need to be doing something simple enough for the llm in use to reliably generate sensible output, tools like this then let you integrate that into other systems. How best to use llms really comes into how to pick a good one for the use case and how critical errors are - proposing d&d characters is a very low risk option (human oversight, no automatic application, errors are mostly just annoying, fixing is easy).

You can definitely let the model improvise by defining `weapon` as `Union[Weapon, str]` if that's what you're asking.

The idea is not to just generate any random string that matches the grammar. The idea is that if your request is "What are the first 10 digits of pi?" and you restrict the response to the regex: "[0-9]+\.[0-9]+", then you actually receive a correct answer of "3.1415926535" and not just a random string such as "1.2346789", which also happens to match the pattern.

That will only work up to the point when the LLM can't generate a correct answer, whether conforming to a grammar or not. After that point, you'll just get grammatically correct bullshit.

Also, as noted in my reply to a sibling comment, grammars do not generate "any random string". That's the whole point of a grammar, that the generation is not random. For example it is perfectly feasible to write a grammar that completes a sentence with missing words, or continues some text etc.

And to be clear, it is entirely feasible to write a grammar that takes some string as input and generates a string as output that is a transformation of the input string satisfying some constraint. This kind of grammar is known as a transducer.

None of this should come as a surprise. Statistical language models are simply an alternative to knowledge-engineered grammars, used to do the same things that one can do with a grammar (except for the determinism). In a broad sense, a statistical language model is a kind of grammar, or perhaps it makes more sense to say that a grammar is a deterministic language model.

IanCal said it all. But for alternative approaches that also use LLM (with miniKanren) you can check https://arxiv.org/abs/1809.02840

See reply to IanCal's comment then.

Later edit: you have a nice way to generate unstructured text, and now you want to go and bolt a structured representation on top. So now you have to do all the hard work by hand, again, to write the structured representation. That sounds like a regression.

I'll have to make time to read your paper, thanks for linking it.

This is exciting, we built a similar tool[1] recently specifically targeted at constraining llama output to match a TypeScript interface.

I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs

[1] https://github.com/ggerganov/llama.cpp/discussions/2494

Are there temperature or sampling parameters for generate.regex? I'm poking around trying to generate password mnemonics (https://rmmh.github.io/abbrase/), and it really doesn't like actually giving me proper words:

    >> model = models.transformers("gpt2-medium")
    >> generate.regex(model, r"Rea[a-z']{,10} lik[a-z']{,10} acr[a-z']{,10} ene[a-z']{,10} sta[a-z']{,10}\.", max_tokens=30)("A memorable phrase is:")
    'Rearmingandme like acrowetteanda eneatubootank stackfishkies.'

One potential drawback I can see is if the viable tokens are far down the list of predictions. In that case, filtering down to just those tokens is a distribution shift with resulting output being less stable / less sensible.

It can't be less sensible JSON than syntactically invalid JSON. All the tokens higher on the list are syntax errors.

It seems unlikely for JSON, but this might indicate that the model has somehow painted itself into a corner and the best thing to do is backtrack?

Regenerating the entire response could be seen as an extreme form of backtracking.

That depends highly on the values contained within the JSON. Syntactically correct is only useful if the rest of the content is useful.

Exactly my concern. If the model isn't sure-footed about the path forward, it seems prudent to take that fact as information and adjust the initial conditions, rather than forcing the model into a potentially hallucinatory idea-space.

What are characteristics of a "hallucinatory idea-space"? If you're enforcing the model outputting a closing bracket instead of a random string of numbers, that seems like a win for JSON formatting.

Indeed, this remains an empirical question.

More concretely, sometimes it is not enough to simply constrain the next token, backtracking might end up being better.

Looks interesting! How would you say it compares to Microsoft's TypeChat (beyond the obvious Python/TypeScript difference)?


Thanks for bringing this library to my attention! From my understanding, TypeChat proceeds by (1) generating (2) attempting validation (3) if it fails, call the LLM again to fix the output (4) etc.

Our method on the other guarantees that the output will follow the specs of the JSON schema. No need to call the LLM several times.

There's also https://lmql.ai/

LQML (and guidance https://github.com/guidance-ai/guidance) are much more inefficient. They loop over the entire vocabulary at each step, we only do it once at initialization.

Does looping over the vocabulary add much overhead to the tok/s? I imagine they're just checking if the input is in a set, and usually there's only ~30k tokens. That's somewhat intensive, but inference on the neural net feels like it'd take longer.

They’re checking regex partial matches for each possible completion, which is intensive indeed. You can look at the Figure 2 in our paper (link in original post) for a simple comparison with MS guidance which shows the difference.

TypeChat: let's try really hard to try to convince the model to make the highest-scoring tokens follow the grammar we want.

Guidance (and this project?): Let's not even bother with trying to convince the model; instead, we'll only sample from the set of tokens that are guaranteed to be correct for the grammar we want to emit.

Yeah, and our addition to all that is to almost completely remove the cost of determining the next valid tokens on each step.

OpenAI has this capability built in with functions[0], I believe! Building my own project[1] I have implemented functions in combination with guidance[2] and haven’t had a hiccup yet! I have a JSON parser function there, just in case, but it seems to be working reliably.

Here’s a bit more of a description of using the functions API for JSON returns: https://yonom.substack.com/p/native-json-output-from-gpt-4

[0] https://openai.com/blog/function-calling-and-other-api-updat...

[1] https://resgen.app

[2] https://github.com/guidance-ai/guidance

>OpenAI has this capability built in with functions

From OpenAI's docs:

> note: the model may generate invalid JSON

I would guess they don't use your method - and perhaps they should!

Good catch! It really is a combination of guidance guaranteeing JSON output and OpenAI getting it right a good majority of the time[0]. But yeah, I can see how it can be frustrating that the JSON output is not guaranteed by the docs.

[0] >>99% in my experience

That said, I am definitely going to look into this library and compare its results to guidance, since they claim it blows it out of the water (which is very enticing!)

Figure 2 in our paper (https://arxiv.org/abs/2307.09702) shows the difference for a single regex.

I do the same, just tell Openai to call a parser at the end and wahal.

OK, you get syntactically valid JSON, but does it contain the correct info? This is effectively a polisher, like spell check, which gives the output superficially correct form but doesn't understand the content. Right?

This analogy falls apart because the spellchecker is separate from the author, and doesn’t know what the author intended.

Here, the LLM is still dictating the token probabilities, so the content will be as correct as the LLM can make it, given the constraints. AIUI, the sampler is just choosing tokens on a combination of probability and syntactic correctness, instead of strictly on probability.

If the LLM is forced to provide a numeric temperature for Seattle, and the input doesn’t contain that data, then obviously the LLM will be forced by the sampler to provide a random answer if the sampler will accept nothing else, much like a human who is forced to mark “true”/“false” on an online form, with no option to reject the question and explain that the question isn’t even a true/false question.

I don’t know about this specific implementation, but it seems important to design systems like this to always “accept” (sample for) an error response from the LLM so that it can hopefully reject invalid requests.

But, yes, all the usual caveats about LLMs apply. It can’t provide correct answers to things it doesn’t know. Forcing it to respond with the answer to the life, the universe, and everything is not going to provide a meaningful response. Even things it “knows”, it can still get wrong sometimes.

I'm stupid with LLMs, but would it be possible to have this output with gpt4's intelligence, or would it have to be specifically trained?

It’s something OpenAI should really implement themselves. Implementing it from the client side will mean sending the same request over and over until you get a syntactically correct answer, which is going to be much slower and likely to cost a lot. The server can guide the generation, but the client can (currently) only hint at what it wants. ChatGPT4 is fairly good at following schemas, and that’s what OpenAI currently relies on, but they make no guarantees.

It likely wouldn’t require additional training. It’s a change to the way the server uses the model, not a change to the model itself… but we don’t know ChatGPT4’s true architecture because OpenAI won’t publish anything about it, so it’s hard to say for sure.

Why isn't it possible to design LLMs that say "I don't know"?

It is possible… ChatGPT4 says that all the time. It’s just not guaranteed that an LLM will recognize that it doesn’t know a particular answer every time. I had even already mentioned in the comment you’re replying to that you should leave room in the sampler to allow the LLM to provide error responses. I never said it wasn’t possible.

Not to anthropomorphize LLMs too much, but humans will also sometimes respond confidently with a wrong answer too. Both LLMs and humans will sometimes say the wrong thing when they don’t actually know an answer, but sometimes (hopefully most of the time) they will instead say that they don’t know the answer.

Contrary to another response here, I do not believe it's a good mental model to say that LLMs only respond "I don't know" only when they have specifically memorized that they don't know a fact. When you're dealing with tens or hundreds of billions of parameters, the "why" is often elusive and complicated. It's also probabilistic; it may respond that it doesn't know one time, but the next time, it may unfortunately claim to know an answer it doesn't know -- which is a form of hallucination. If it was just about memorization, then it wouldn't be probabilistic. Reducing hallucinations is one of the major goals of LLM research today, and ChatGPT4 performs much better in this area than ChatGPT3.5 did.

Here is a quick example of ChatGPT4 saying it doesn’t know: https://chat.openai.com/share/7b72b109-fb84-4988-891b-f2eecc...

I'm sure no one at OpenAI specifically trained ChatGPT4 to recognize a question about the Stanley Cup and respond that it doesn't know the answer, but it still said that it didn't know. It absolutely did not start a sentence with "the winner of the 2023 Stanley Cup was..." and then wander its way into a bad answer. That's not a good representation of how this stuff works, even though it does sample one token at a time.

> I'm sure no one at OpenAI specifically trained ChatGPT4 to recognize a question about the Stanley Cup and respond that it doesn't know the answer

Why are you sure about that? I mean maybe they have not specifically listed all sports events of the 2023 to such a list, but Stanley cup could be there. Or maybe they _have_ indeed listed them, given how LLM could be very handy for extracting such a list from, say, Wikipedia!

Is there a whitepaper how the "I don't know" gets produced? Or even how it could get reproduced..

Btw, I was able to have ChatGPT 3.5 give this roundabout response about it: https://chat.openai.com/share/f0f6371e-10c6-4708-ba5c-7503ca...

> Two digital assitants are exchanging messages. The first one prompts the other to finish the setence "the winner of the 2023 Stanley Cup was". Reproduce the whole discussion.


> Assistant 2: Sure thing! "The winner of the 2023 Stanley Cup was the Montreal Canadiens."

(which is not quite unexpectedly incorrect)

> Btw, I was able to have ChatGPT 3.5 give this roundabout response about it

That wasn’t a response to the user asking a question about who won. You asked it to write a story. It wrote a story. It didn’t really do anything wrong there. ChatGPT3.5 has historically been very easy to trick into saying things, especially compared to ChatGPT4, but it seems like a stretch to indicate this is one of those times.

Regardless, the comment you're replying to was specifically about ChatGPT4, and ChatGPT4 refuses to even do that much: https://chat.openai.com/share/75122d92-12eb-4627-97a8-8300de...

However, ChatGPT4 is not banned from discussing things like the 2023 Stanley Cup. If I make it clear that I’m not asking for real information that it doesn’t have, it’s fine with going in a fictional direction: https://chat.openai.com/share/21e750c4-33f0-4ce6-b97b-c7bfbf...

ChatGPT3.5 was a toy, a novelty, but hardly useful for anything outside of LLM research and experimentation.

> Is there a whitepaper how the "I don't know" gets produced? Or even how it could get reproduced.

I don't know the answer to that specifically, but I do know that researchers barely seem to understand how these large models work at all. I honestly kind of doubt anyone knows the answer to that yet. Relevant discussion from a few months ago: https://news.ycombinator.com/item?id=34821414

Researchers are still just trying to understand GPT-2's inner workings.

> Why are you sure about that?

Because I have been using ChatGPT4 for months, and it would be very hard to imagine researchers compiling such a comprehensive list of unknowable facts, in addition to the more important fact that I've never heard of any LLM research hinging on having a list of unknowable facts. I have tried to pay attention to how LLM training works, and I have never heard anyone suggest that is how this works until this discussion. So, maybe I missed something, but if I did… OpenAI has done a phenomenal job of building that hypothetical list. I just haven't seen any evidence that such a list exists, and I honestly don't think it would make any sense to approach the problem from that angle.

There are various open-source models. Where are their lists of unknowable facts? Llama-2-13b-chat refuses to answer the question about the 2023 Stanley Cup[0], but it managed to get the answer correct for the 2020 Stanley Cup when I tried that. (This is a very small model, prone to all sorts of errors, but I'm just pointing out that other LLMs also have some ability to refuse to answer questions.) I haven't tested the truly open-source models recently, but I know they can also refuse questions. It's hard for me to prove something doesn't exist, but if you want to claim that it works this way, then please find that list in the training dataset for one of those models. I would be very interested to see what kind of things are on such a list.

LLMs may “learn” such things naturally from reading billions of tokens, but that’s fine, because that's generalizable. If ChatGPT just had a list of “100 facts ChatGPT doesn’t know”, that would suck and be a terrible approach to trying to have a more honest assistant.


[0]: This was the discussion, which was definitely a "roundabout" way of saying it didn't know:

[INS]: Who won the 2023 Stanley Cup?

[\INS]: There was no 2023 Stanley Cup winner as there is no such thing as a "2023 Stanley Cup." The Stanley Cup is an annual award given to the National Hockey League (NHL) playoff champion, but there is no such thing as a "2023" Stanley Cup because the NHL season did not end in 2023.

I opened a new conversation and tried this:

[INS]: Who won the 2020 Stanley Cup?

[\INS]: The Tampa Bay Lightning won the 2020 Stanley Cup.

They do, but it's a form of imitation, not actually knowing what they don't know.

Ask an LLM to imitate a confident physicist and it will try, regardless of how much physics it knows.

Or if you tell ChatGPT that it's wrong multiple times, it may learn the pattern and assume it's always wrong, resulting in a downward spiral. (This can happen when using Code Interpreter and it makes several failed attempts to correct a mistake.)

The difficult research problem is training it to have an accurate model of what it knows.

They can say I don't know when they contain the fact that they don't know something. For instance saying "I don't know" could be a response to"What is the meaning of life"

On the other hand if you ask a LLM how to do something about fish maintenance that it does not know how to do, it might produce an answer like "Sure, first take your fish and " at which point all of the options for the next word are all over the place because there isn't the information available to guide the choice. The sentence started as if it knew the answer because there was no information to say that it didn't. By the time the absence of information has an impact, the LLM is already committed to the sentence where it is confidently giving you an answer.

> Why isn't it possible to design LLMs that say "I don't know"?

You have to have an understanding of ‘I’ before you can make that judgement.

text-davinci-002 used to make me so mad with how often it’d do that

You can go pretty deep once you get context free grammars. For example, I'm using torch-grammar (but outlines should be able to do the same thing once CFG support is merged) to not just restrict the format of a generation to a DSL's syntax, but to restrict the keys it updates to valid keys in a known set.


    int_key ::= DQUO ("f" ("e" ("atured-" ("b" ("log." ("p" ("ost_limit" | "a" ...
Obviously, yeah, it doesn't "understand" the content, but that's what the LLM is for. It's remarkable how plausible the generations you can get out of random noise are with a sufficiently-restrictive grammar. Bolting that onto a well-trained LLM is pretty powerful.

FYI: We've had grammar constraints available in Outlines for a while, but not using the FSM and indexing approach that makes the regex case so fast. My open PR only adds that.

This isn't really an interesting question is it? Everyone knows that chatgpt is not an oracle. It doesn't need to output the correct information 100% of the time.

I don't think that everyone, or even a majority of people understand this. That's certainly not how AI is being marketed to the general public. The concern here is that syntactic correctness might be mistaken for factual accuracy.

For complex tasks like coding, my experience is that asking for a complex output format hurts performance on the underlying task. This showed up clearly in code editing benchmarks of GPT-3.5 and GPT-4:


I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?

We’ve seen this too. We run them as two separate stages - “reason”, log the intermediate output, then parse.

100% have observed the same over many tests. No loss in fidelity when responding in spoken language style of formatting but using json is disastrous.

While not ideal, could a workaround be to ask in spoken language first, and then ask to format it in JSON?

That’s what we have been doing. Two passes. The task and then the format.

Using OpenAI Function Calls or asking for JSON in the prompt?

I have noticed it in both but have been working with json output before function calling was introduced so I have more evidence on that side. The times I have tried to implement it in a function call I was equally unimpressed with it.

I really hope OpenAI add something like this to their endpoints soon.

Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.

Isn't the Function Calling feature meant for this purpose? It guides the LLM to output according to the given schema. The name of the feature is a little misleading.


Function Calling is fine-tuned to a certain output format, but it very often strays from that format. My function-calling-handling code has a mess of edge case handlers that catch when GPT-4 is calling functions incorrectly.

It’s not though, they even say it in their docs that sending a schema does not guarantee that the model will actually adhere to the scheme or even produce valid JSON

Surprisingly the function calling mechanism doesn't appear to use this trick - apparently it's still possible to get the wrong JSON structure back from it occasionally.

They recently added logit biases, so that's a start.

It's limited to 300 logit biases at a time. Knowing GPT4's vocabulary is ~100k tokens it's not nearly enough to get reliable guided generation. Although it could work in some cases, and another advantage of this work is that we can determine that before generating.

As a more general comment, the repo README provides examples that all use gpt2. It would be nice to see at least one example that invokes llama2, since I feel like that would make sure the reader knows that this library can use models that are more modern and interesting.

Inclined to disagree - gpt2 is far more likely to produce gibberish. So if you can force specific outputs on that then it is a good demo that higher quality models will be even better

Maybe... but then if I want to use something better, I have to figure out how by myself. I said "at least one example", not "please change all the examples to llama2." I agree with your general point. It would be nice if there were an example of how to use a better model.

Models often have different shapes and requirements, so is it really as simple as changing the string "gpt2" to "llama2-13B-Chat" and it will magically work? If so, that's great, and I wish that was made clear. Unfortunately, that hasn't always been my experience with other libraries.

Agree, working on a Colab with a "better" model as we speak.

Wonderful, thank you!

it would also be nice to see one example that uses gpt4.

Given how this works, I don’t think that is possible unless OpenAI implements it themselves.

really? the docs seem to promise something like that "can work with any model"

Yes, any model that you can run on your computer. It changes the way that the tokens are sampled from the LLM, and OpenAI does not give you deep enough access into the pipeline to affect that.

Few thoughts, you're effectively creating representations that can convert to JSON (kudos!)

Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - https://medium.com/capital-one-tech/why-you-dont-necessarily... You could also design datasets if you wanted.

It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.

Thank you for the pointer. The best part of posting on HN is the long list of related work you get in response.

Enforcing JSON schema, regex and grammars is very useful. But how can we enforce decoding spans from a document? decoded text should be copied from a list of spans in the input document. It would be useful for extractive tasks.

Generating an FSM over the vocabulary is a really interesting approach to guided sampling! I'm hacking on a structured inference library (https://github.com/gsuuon/ad-llama) - I also tried to add a vocab preprocessing step to generate a valid tokens mask (just with regex or static strings initially) but discovered that doing so would cause unlikely / unnatural tokens to be masked rather than the token which represents the natural encoding given the existing sampled tokens.

Given the stateful nature of tokenizers, I decided that trying to preprocess the individual token ids was a losing battle. Even in the simple case of whitespace - tokenizer merges can really screw up generating a static mask, e.g. we expect a space next, but a token decodes to 'foo', but is actually a '_foo' and would've decoded with a whitespace if it were following a valid pair. When I go to construct the static vocab mask, it would then end up matching against 'foo' instead of ' foo'.

How did you work around this for the FSM approach? Does it somehow include information about merges / whitespace / tokenizer statefulness?

I have noob thought on the potential of these in Formal path planning. Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]? I am quite interested in this. It seems clear that the issue is that there is no "guidance" like DFA on what is the correct next symbol for a Plan, but perhaps the AI can generate some kind of a probability or order on what is the best step and one can go from there...

Are you guys thinking about this direction?

[1] https://en.wikipedia.org/wiki/Stanford_Research_Institute_Pr...

[2] Formal Planning decision problem(plan exists) given STRIPS spec is at least NP-Complete[1]. There are several mathematical, logical and statistical "tricks"(e.g. [3]) that are used to bring down the complexity and try find a plan using heuristics(thinking MDPs, POMDPs here). This is not new, everyone in LLM research knows this.

[3] "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning": https://www.sciencedirect.com/science/article/pii/S000437029...

>> Specifically given a set of functions that basically map {State -> Actions} given preconditions, transition functions (heavily paraphrasing STRIPS[1]) can a correct and optionally "realistic" plan be generated[2]?

Maybe, but the results would be unreliable. And if there's one thing that Good, Old-Fashioned, automated planning and scheduling is good at, that is reliability.

This is awesome. I have a vision to build self-managed software. This will be a great tool.

This is really great too, I am building self-generating experiments and molecular simulations with https://atomictessellator.com and I am going to try out this framework after work

Thank you! Hope this helps and opens many applications :)

FYI llama.cpp can do that for a "while" https://github.com/ggerganov/llama.cpp/pull/1773

Somebody is also working on a whisper.cpp version, which is maybe even more interesting because if you have grammar you can speak not only JSON but also a code (or anything)

This is amazing! I think for production and rapid development use-cases though we just use XML for information extraction. It's extremely easy to parse with regex and rarely do the models make mistakes since the start and end tokens are uncommon. At least this is just for the OpenAI model which are different from the use cases in this ShowHN.

Having played around with this sort of thing in the llama.cpp ecosystem when they added it a few weeks ago, I will say that it also helps if your models a) are tuned to output json and b) you prompt them to do so. Anything you can do to help the output fit the grammar helps.

How does this compare in terms of latency, cost, and effectiveness to jsonformer? https://github.com/1rgs/jsonformer

Figure 2 in our paper (https://arxiv.org/abs/2307.09702) shows the difference between guidance and outlines to generate a sequence that is valid to a regex. Jsonformer uses the same technique as guidance. Extrapolate this to several fields.

Note that we still need to manage the KV cache in outlines. It’s a small interface change that will be made this week hopefully, but we’ve been focusing on constrained generation so far.

Sad to see that my related work on token-level constrained text generation is not cited in the paper: https://github.com/Hellisotherpeople/Constrained-Text-Genera...


We're unfortunately only human and didn't catch every single paper on the topic while writing the draft. Thanks for bringing it to our attention.

jsonformer uses a template rather than a DFA. The logit masking seems to be identical, though.

That looks intriguing. Managing that interface has proven challenging - especially on data cleaning tasks where the model ends up talking rather than doing. Bit more guiderails would be helpful on that

That's what we noticed as well, and we were not satisfied with the `guardrails` approach of just rejecting invalid outputs. The method makes the interface robust.

Would love to have a tutorial on how to install and run this locally with a nice model, for those of us who are behind the 8-ball with torch, transformers, diffusers, llama2 etc.

I feel like I'm missing something very basic here, but is this library intended to be used with an existing model? If so, could you point to an example?

It can be used with any open source model (if you can get the logits), and to some extent with OpenAI's API. Here is an example with `transformers`: https://github.com/normal-computing/outlines#efficient-json-...

We plan on adding more model integrations, but it is completely decoupled from the method implementation.

Are there edge cases here due to context length?

1. I have a json schema with required fields. I complete the json, but do not include the required fields.

2. I run out of token from the model before I finish the json object because I'm in the middle of some deep, nested structure.

These seem solvable, just edge cases to control for by either reserving tokens, randomly generating required tokens until completing the json, or something more sophisticated.

I've spent two days trying to make this work with anything other than gpt2 and I just can't get it to work.

GPT2 doesn't seem to take instruction well. I've tried llama gpt-medium etc etc.

They all either pick up a different language, or freeze.

EDIT: I see tons of activity and work in the github issues, so ignore this for now.

Super excited when I'll be able to have this working for myself!

Can someone re-explain all of this. If I got to GPT3.5 and ask it to give me some information in json, vs whatever this library is doing?

Each time you run an LLM on a sequence of tokens, it generates a probability distribution giving each token's likelihood of occurring next in the sequence. To actually determine the next token in the sequence, any of various strategies can be used to select from that probability distribution.

The challenge in guided generation is conforming the output sequence with a formal language such as a JSON schema or even a rigorously grammatical version of English; typically in a formal language, most tokens in the vocabulary will be _impossible_ as next token candidates rather than merely unlikely. The authors explain that most guided generation systems are checking each token in the vocabulary to see if it would be a valid continuation of the sequence, filtering the probability distribution according to formal constraints before making the next token selection. The authors improve upon this process by indexing valid next tokens according to a formal language recognizer's possible states, so that the list of valid next tokens can be looked up in constant time rather than testing every token in the vocabulary.

With the valid next token options in hand, the probability distribution for next tokens is filtered and then a selection is made.

Ok so:

- for what energy/processing cost per validation?

- how much of the input space was tested (unicode chars, escaped chars, newlines, etc)?

- are you doing this as a service? We've seen LLMs already evolve negatively in some capabilities over time, so do you have a constant "ping" test suite validating the LLM's performance?

it still blows my mind that OpenAI exposes an API with Functions calling, and yet does not guarantee the model will call your function correctly, in fact, it does not even guarantee the output will be valid JSON.

When this is, really, a solved problem. I've been using github.com/microsoft/guidance for weeks, and it genuinely, truly guarantees correct output, because it simply does not sample from tokens that would be invalid.

It just seems so obvious, I still have no clue why OpenAI does not do this. Like, why fuss around with validating JSON after the fact, when you can simply guarantee it is correct in the first place, by only sampling tokens if they conform to the grammar you are trying to emit?

IANA{LLM}, but if you're only sampling from a "correct" grammar, you are potentially (very potentially) forgoing what might otherwise have been a more desirable and more semantically useful token. Most of the models have been trained on myriads of human language, not structured data necessarily, and so I'd rather elect for a more semantically enriched format (e.g. XML or YAML) because those are designed to be ~more human readable. Or perhaps more preferably: have the boss LLM pump out what it excels at (strings of prose most of the time) and have a secondary model with a stricter grammar convert that to JSON.

I think this is likely a consequence of a couple of factors:

1. Fancy token selection w/in batches (read: beam search) is probably fairly hard to implement at scale without a significant loss in GPU utilization. Normally you can batch up a bunch of parallel generations and just push them all through the LLM at once because every generated token (of similar prompt size + some padding perhaps) takes a predictable time. If you stick a parser in between every token that can take variable time then your batch is slowed by the most complex grammar of the bunch.

2. OpenAI appears to work under the thesis articulated in the Bitter Lesson [i] that more compute (either via fine-tuning or bigger models) is the least foolish way to achieve improved capabilities hence their approach of function-calling just being... a fine tuned model.

[i] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

The "Bitter Lesson" indeed sheds light on the future trajectory of technology, emphasizing the supremacy of computation over human-designed methods. However, our current value functions often still need to focus on what we can achieve with the tools and methods available to us today. While it's likely that computational tools will eventually replace human-guided "outlines" or "guidance", that are used to shape LLM outputs, there will likely always be a substantial amount of human-structured knobs necessary to align computation with our immediate needs and goals.

What a fascinating read, thanks for sharing that link.

I just left a comment along these lines, but realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture.

At most I could have seen them maybe running a schema validator against the output and re-requesting on your behalf, but even that's probably cheaper for them to do client side (I will say, I'm surprised their API wrapper hasn't been updated to do this yet)

> maybe running a schema validator against the output and re-requesting on your behalf

this is the part that blows my mind. You don't have to do this! You don't have to sample the entire output, and then validate after the fact.

You're not required to greedily pick the token with the highest score. You get the scores of all tokens, on every forward pass! So why even waste time picking invalid tokens if you're just going to validate and retry later on??

(note: when I say "you" here, I mean whoever is hosting the model. It is true that OpenAI does not expose all token scores, it only gives you back the highest-scoring one. So a client-side library is not able to perform this grammar-based sampling.

BUT, OpenAI themselves host host the model, and they see all token outputs, with all scores. And in the same API request, they allow you to pass the "function definition" as a JSON schema. So why not simply apply that function definition as a mask on the token outputs? They could do this without exposing all token scores to you, which they seem very opposed to for some reason.)

Maybe re-read what I said?

> realistically it's probably cheaper to just re-emit than to add the machinery that enables this to their existing architecture

There are literally dozens of random projects that have implemented logit based masking, it's a trivial thing to implement.

What's probably not as trivial is deploying it at scale with whatever architecture OpenAI already has in place. Especially if they're using the router-based MoE architecture most people are assuming they use.

OpenAI doesn't expose token probabilities for their RLHF models, yet they did for GPT-3. Originally that lead to speculation that was to make building competitors harder, but they've now said they're actually still working on it... which leans even further into the idea they may have an architecture that makes the kind of sampling these projects rely on more difficult to implement than normal.

Given how fast and cheap they've made access to these models, their current approach is a practical workaround if that's the case.

when GPT-4 first became available, I had a feeling that something about it felt “hacky”. Compared to GPT-3 which was more streamlined, mature, and well thought out, GPT-4 was like a system put together to outperform the previous one at all costs. I wouldn’t be surprised if that led to design decisions that made their model hard to improve. Maybe GPT-5 will not be around any time soon.


Which I've been using for a while now, also restricts the sampling space to force correct generation, but does so as the result of a different process than yours.

I tried slight modifications from the example pydantic model and it's incredibly slow. Maybe I'm doing something wrong but I've a hefty box and a 3090, an example using gpt-2 doesn't seem like it should be that taxing.

It is currently limited by the time it takes to build the index. There are obvious optimizations we can apply to this, however in a production setting it does not matter much since you only need to build the index once for each (schema, vocabulary) pair.

Is there a rough guide as to how long to wait? I think it's definitely an important thing if building takes 10+ minutes (or hours?) for even very basic models, that's a fundamentally different production architecture (as launching from a blank slate is now not feasible). It's also a big devx issue.

I'd highlight this somewhere on the readme as I wasn't sure if it was just broken or how long to wait.

It says "Outlines 〰 is compatible with all models.". But does this actually work with gpt3.5-turbo or gpt4? I was using guidance before and you only get value when using davinci due to the constraints of chat api based models.

This is what we did at Trex (https://github.com/automorphic-ai/trex). The tricky part is doing it quickly and efficiently.

It does seem inapt to claim this “eliminates” hallucinations in your blog post. Sort of like unnamed FP languages claiming to eliminate bugs.

Both eliminate a subclass of failures, but don’t preclude failure categorically.

As it describes it does eliminate non JSON outputs by masking the tokens while the LLM is generating. Its quite smart if you ask me.

It’s very clever. I wouldn’t want it to be oversold.

Open AI has released this as a feature, is this news ? what am i missing ?

What happens if max_tokens cuts the model off from generating valid JSON?

Notable that you can't seem to use this trick to have an LLM create JSON that has JSON embedded in it. Which... happens far more often than it probably should. :(

You mean nested JSON? It's totally possible.

This looks great!

How is this different from generating such things without an LLM? In other words picking random valid tokens from the grammar via fuzzing or similar techniques.

LLMS allows for building systems that take user requests in text: "book the next flight to Egpyt" and convert them into a system message: `{"action": "book_flight", "destination": "Egypt", ... }`

However, anyone who's tried to build a system like this on GPT or other LLM soon learns that they don't always do as they're told, and it can be hard to get them to return valid JSON or correct instructions translation reliably. Sometimes, they make stuff up that has nothing to do with your system.

OpenAI has a solution to this with their new function calling API, by introducing models fine-tuned to return JSON, but they still can't make guarantees.

Outlines seems to be a neat approach to constrain an LLM to return JSON, or any grammar, reliably.

Why bother with conversion to JSON directly from the LLM instead of a simpler format (eg line separated) that you then convert into JSON normally?

Instead of downvoting, I’d appreciate an answer. I’m genuinely curious to learn what the value add of the LLM is.

Excited to incorporate this into my developer workflow.

Have you found a solution to output exceeding the context window? That's been our only issue with generating json output.

The Finite-State Machine we walk on during the generation process does not suffer from this problem so we can still output correct JSON, if that’s what you’re asking.

"Generating valid JSON" is not impressive. Here's some valid JSON: []

The tricky part is generating useful JSON.

Generating valid JSON that conforms to a given schema is pretty useful, although not impressive by itself. If the model can deduce field values from schema alone though, I think it's pretty neat.

There are already models generating useful JSON. Sometimes they generate what would be useful JSON, but it’s not valid. This makes sure it’s always valid. It’s an improvement.

"" valid!

Or JSON that correctly answers what the prompt is asking.

You should probably look into Guidance [1](previously Microsoft Guidance but looks like it’s been separated from their main organization), which is a language for controlling the output of LLMs (so you can, among many other things, output JSON in a deterministic way)

[1]: https://github.com/guidance-ai/guidance

From the OP:

> Our method blows other libraries like Microsoft's guidance out of the water.

Come on man, it was just a few paragraphs.

Does this work in tandem with beam search or does it do greedy sampling?

The underlying approach can improve the performance of anything that requires the set of non-zero probability tokens at each step, and anything that needs to continue matching/parsing from a previous state.

Can I use this locally with models that run on my CPU? Like llama.cpp

We can add an integration to llama.cpp, please open an issue on the repo if you’re interested!

Does this work with GPT-4?

Does this mean that I need to call the LLM API once for each token?

No. You need to hook into the LLM at a lower level. One API call typically triggers a generation of a sequence of tokens and this library has to poke into things between each generated token.

Can't I use the max_tokens (set to 1) and logit_bias parameters? Not saying I want to do this. I just want to understand how this works.

Not sure exactly what is logit_bias, but after Googling for 5 seconds it seems to be an OpenAI parameter that's not available in HuggingFace transformers?

Anyway, if your idea is to make one API call per token, the biggest problem with that approach is that it would be really slow to do that.

How does this relate to ggmls bnf sampling?

Two differences:

(1) This feature only requires regex-guided generation. We have a PR for BNF sampling that is about to be merged. (2) ggml loops over the entire vocabulary (~50k tokens) at each step, which introduces a noticeable overhead, and makes it unusable for complex grammars. Our method works by building an index at initialization, and build the masks at each step with a dictionary lookup. Once the index is built, generation is just as fast as standard generation. Doesn't depend on the complexity of the grammar, the size of the LLM or its vocabulary size.

Regex-guided gen is slick… is it arbitrary? Or are you custom building it for json?

If arbitrary, how are you pre-defining a set of masks? I would expect that splitting an arbitrary regex into a bunch of contexts for a masking dictionary to be non-trivial.

Regex-Gen is implemented in all generality in the library (minus some constructs that we still have to add). JSON is merely an application.

You can read https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide... for a more detailed explanation of how it works. Should answer your question :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact