Hacker News new | past | comments | ask | show | jobs | submit login
Jsonformer: Generate structured output from LLMs (github.com/1rgs)
340 points by yunyu on May 2, 2023 | hide | past | favorite | 83 comments



I've thought about building this for a while, glad it's out there!

Not only does this guarantee your output is JSON, it lowers your generation cost and latency by filling in many of the repetitive schema tokens without passing them through the LLM.

For the very common case of "extracting multiple structured fields from a piece of unstructured text," I believe there's an even stronger optimization possible that would further decrease costs, latency and potentially even improve accuracy.

Assuming the fields you want to extract are independent (and they often are), you don't need to generate them all in one go autoregressively. Eg. instead of running the following pseudo-prompt:

    "Input: 'It's sunny and cold today'
     Output schema: {"sunny": boolean, "temperature": string}"
You could instead run the following two:

    "Input: 'It's sunny and cold today'
     Output schema: {"sunny": boolean}"

    "Input: 'It's sunny and cold today'
     Output schema: {"temperature": string}"
We don't do that today because when done naively it's very inefficient -- you'd be tokenizing, passing to the GPU, and computing the KV cache of the shared part of the prompt twice. But a library with the right abstraction could run the second two queries in a batch in parallel and reuse the same tokenization and KV cache for both of them. It would actually be more efficient than generating both fields in one go, since when you factor out the shared prefixes both the generated text and its context are shorter!

I mentioned above that this could also improve accuracy. Of course it doesn't do that by default (except that by excluding all the irrelevant fields it makes self-attention's job easier). But what it does do is give you an independent prompt for each field you're interested in. And so for particularly tricky fields you're trying to extract, you have the flexibility to eg. add several examples to make the generation N-shot.


Maybe this will make CUE popular. It’s similar to JSON, but the idea of schema and values are put together through unification, or you could say narrowing constraints. CUE would handle taking all of those values individually, then combining them into something concrete, incomplete, or erroring out.


I've already been experimenting. CUE is different enough, but close enough to confuse GPT. I haven't tried fine tuning yet. I should go back and try few-shot with GPT4 now that I have access.

What I've been doing instead is having LLMs generate JSON, putting what jsonformer does in the first prompt for few-shot learning, and then combining it with CUE after, since you can easily intermix data files with CUE.

My latest experiment: https://twitter.com/verdverm/status/1652504163635347456

Creating the prompt for this was pretty interesting and illuminating. While it works for the full text there, you can also do it in parts, only outputting the new parts of the JSON that is merged with the CUE.


Can you briefly describe how you got to the point of having this kind of intuition about language models?


Kind of a meta-answer, but my personal learning style is "think of something cool to build, then figure out what I need to know to build it." It just so happens that a lot of the interesting/cool stuff going on right now builds on top of LLMs, so my projects have naturally gravitated that way. But I've never sat down and to take an AI course or read the "Attention Is All You Need" paper or anything. I absorb much more when I learn something because I need it for something I'm working on.


it is a correct answer.


Not op, but I can share my approach - I went line by line by Recmo's Cria: https://github.com/recmo/cria - which is an implementation of Llama in Numpy - so very low level. Took me I think 3-4 days x 10 hours + 1-2 days of reading about Transformers to understand what's going on - but from that you can see how models generate text and have a deep understanding of what's going on.


I can't speak for OP, but something that I think helps is if you think about the generation process as a jumping off point that one can control the placement of, but not really much that is generated afterwards.

Adding a scheme like this reduces the area of potential off-roading that the LLM can do to a much smaller zone. Additionally, it breaks up the chain of dependencies between the two example outputs, because now we do not need to depend upon past inputs to correctly output this scheme.

Since the information for JSON semantic structure is no longer required to be driven by the LLM (it still has to understand it to still be able to generate things with a modicum of sense, IIRC), we can look at our dependency graph for outputs. _This changes because now the fields really and truly are independent, (if they are truly informationally independent) _.

So now some kind of conjoined information requirement of ( autoregressive output ) <- (( field A ) <- ( field B )) becomes ( autoregressive output ) <- (( field A ) && ( field B )) which then can be factored out into separate calls instead of sequentially, which yields us a batched call of (( autoregressive output A ) <- ( field A ) && ( autoregressive output B ) <- ( field B )).

From there it is just implementation. I likely would not have thought about the OP's way of handling things for a good while, though maybe I would have stumbled into it had I enough reason to think about structured/templated kinds of generation, which I do believe that I do now! <3 :) It really breaks a lot of assumptions that are easy to quietly make and I had not thought appropriately about the consequences of reframing things in this way, to be honest.

As for "how" to think about this, if I were to give my take, it would be always just turning whatever problem in front of you is into a puzzle where you simplify it further each time. Optimizing for less computation, time, code, or even just what all of those are a kind of proxy for: less information to sufficiently solve a problem. We can see that this problem is reduced in complexity appropriately because we remove a redundancy that does not need to be there at all.

One way to look at this is in the relationships between parts of an idea. If you're able to understand, even vaguely, the concepts behind some other concept and how they interact, and maybe even have a 'standard toolkit' of relating to them, you can start finding/transferring/applying other skills to these parts of a concept. I don't think there's a guaranteed-efficient way to maybe reduce a concept down to its parts, or down to a more efficient representation without already, well, knowing that representation. It's an NP-hard problem to me personally, and is the reason why research and other academic pursuits can take a while. It is a good skill to learn I suppose and I certainly enjoy trying to use it, personally.

To tie this back to your question about language models -- yes, some things have to do with the language model, but oftentimes it's actually just the raw mathematical components underlying a model. If you look for that, and (please please please please please!!!!) then you don't necessarily _have_ to concern yourself with the implementation details (beyond runtime limits, etc), as long as the math still applies you should be able to reason really quite well about what else is happening/could happen with a model type like these are.

In particular, LLMs being an autoregressive model where each output depends upon its inputs lets us set up a dependency graph. Then based upon some prior assumptions, we can maybe make some substitutions/changes that allow us to fragment the dependency graph and move it around as we wish. This is not just applicable to LLMs, however, dependency graphs are useful in a wide number of areas.

So one other thing that we're not talking about here is that we're optimizing for an objective we want (clean JSON) by explicitly...well, injecting that objective instead of living on just hopes and dreams, y'aknow. This is a pretty straightforward way of solving the problem by putting the answer in the question, though poor input content still can be a problem.

Stated a different way, we're collapsing the entropy of what the network can introduce (which should be JSON, but remember [!!!!!!!], neural networks are noisy estimators, and JSON errors are mathematically guaranteed (even if rare), which means any pipeline depending upon output like code can and will fail, and is brittle to all sorts of other kinds of complicated parsing errors. This is because to catch/detect/enumerate/correct these errors, we need to have all of the information needed to implement a JSON structure itself. So basically we'd be using the same exact information, just enforcing it in a horrendously inefficient manner, which is how people have been doing it until the present, which is okay as we humans are certainly not NP-optimal machines IMO. In any case, we're still in the parentheses, and the point was that any kind of variance can be a problem here beyond some extremely tiny limit, and that's not what LLMs are made to do. So at some point it's guaranteed to break, and high volumes -- it's basically guaranteed to break in a way that's either unusable or requires so much effort to fix that you might as well have embedded a JSON prior into your network generation process because it would have required the same amount of information as external validation would, albeit with less effort (!!!!)), which is perfectly fine in our case if we're exclusively generating JSON as it gives us what we want. Most methods like this thankfully should have a low level of invasiveness to the model as well, freeing us up to use either the same or a similar model for multiple tasks.

This can create a bit of an ideological illusion as we technically are destroying information by collapsing the distributions of sentences/strings of tokens/etc that we are generating, and maybe can lend to a "oh, we can add whatever functionality we want!" kind of belief about this kind of modeling. It's important what we're adding and taking away. Also important is part of how/why/what is so powerful about training these models on next token prediction on large text corpora. We can trim them down to some smaller subproblem much much more easily than we can expand them to cover a larger subset. Which is pretty darn cool!

I know this sorta flew around a lot of places and touched on a lot of things, probably not as cogently as I'd want to if I had more time to review and revise it. Hope it was/is helpful for you and feel free to let me know if you have any questions. It's a very cool topic on the whole to me, tbh, and there's a number of interesting conversations that can branch off from this one. Honestly this whole general area is where I see the real value in LLM development in research. It's practical and it's helpful! :D :) <3 :)

Source for experience is a number of years of experience across a wide variety of ML models, though I'm sure I made an embarassing blunder or two in this post. ;P


You'd need to put the input first for this approach to work, but in my testing models work better if you lead with a question.


Hmm. I admit that I haven't thought about this deeply, but I'm not sure that's true? It seems to me that you could extend the KV cache either backwards or forwards equally easily.


You can’t. The later values depend on the earlier ones, so changing the early tokens invalidates your whole cache.

This is also probably why leading with a question works better in the first place. All later processing conditions on the question in this way.

BTW, in my very limited testing, GPT4 doesn’t care about the order.


I could be reading this wrong, but my assumption is/has been that the prompt goes up to the end of the JSON field name, and the LLM is only filling in the actual value, not the key. I could be wrong on this one, however.


Oh nice! I built a similar system a few weeks ago: https://github.com/newhouseb/clownfish

I think the main differentiating factor here is that this is better if you have a simpler JSON schema without enums or oneOf constraints. If you do have these constraints, i.e. let's say you wanted an array of different types that represented a items on a menu { kind: pizza, toppings: [pepperoni] } or { kind: ice_cream, flavor: vanilla | strawberry } then you would need something more sophisticated like clownfish that can ask the LLM to pick specific properties (and an ability to do some backtracking so you can do proper beam search).

For completeness, another common approach can be found here: https://github.com/ShreyaR/guardrails which essentially boils down to "provide the schema in the prompt and ask the LLM to correct things if it fails to get the schema right the first time."


Another very similar approach to guardrails which manages to avoid XML that I've been using with some success is langchain's OutputFixingParser: https://python.langchain.com/en/latest/modules/prompts/outpu...


I have stumbled upon your repository a week ago and I have to say, great work and great ideas!

Another thing I thought about is integrating formatting for fields using a similar system. ISO-8601 dates comes immediately to mind but also number and currency formatting are other examples.

Probabilistic enums is another thing that I can think of that might be useful for fallback values, I am pretty sure there's a lot of work that can be done in this area, also for other parser kinds

related and highly recommended resource is https://github.com/mkuchnik/relm and https://arxiv.org/abs/2211.15458. It is a similar system used to validate LLMs using regexes, however built for completely different use cases. I imagine integrating regex checks to the output fields can also have a lot of use cases.


Thank you! ReLM is a great find! I like that it drives the generation itself so that it can explore different beams more intentionally. And to do the JSON Parsing well against enums/unions/oneOf, you really have to support backtracking in a way that works basically the same as it does for regex so I'm looking forward to digging into their implementation.


One thing that I really like about the approach you took with clownfish is that it doesn't constrain or modify the structure of the prompt.

One of the primary difficulties with writing LLM applications is that prompts are basically not composable, and any LLM library that modifies your prompt is going to be a nightmare to work with.


Follow-up thought I just had: It seems that prompt structure standards are going to have to emerge if any of these tools have a shot at interoperability. I don't have hard data, but IME if a prompt is structured

MEMORY EXAMPLE INSTRUCTION [COMPLETION]

it will basically not work to wrap it in a prompt that's structured

INSTRUCTION MEMORY EXAMPLE [COMPLETION]


Interoperability can also be achieved with small adapters written for the prompting style of the particular model being interfaced with, I'd be surprised if like LangChain or AutoGPT don't already do something like this in their systems.

I'm currently building something that leverages an ensemble of different LLMs depending on the difficulty of a task and ran into this issue.

Dolly V2 takes "###Instruction: <your stuff> ###Response" as the structure fed to the model where as GPT3.5 Turbo wasn't trained to treat that particular structure as important.

The nice thing is that GPT3.5 Turbo will just roll with the prompt structure Dolly uses but that only works in very large LLMs, I'd imagine I wouldn't get away with it in other 12BN parameter models.

But realistically this could look like taking the "INSTRUCTION MEMORY EXAMPLE [COMPLETION]" schema represented in a library and each adapter would transform it into

"MEMORY EXAMPLE INSTRUCTION [COMPLETION]" schema or whatever is needed by the different model.


I think I agree, and am doing something similar as well. Even the adapters approach does require some amount of consistency across prompts. For example, if you have one prompt that says “you are a very helpful assistant” and one prompt that says “you are a very lazy assistant”, then even if those prompts are otherwise written to be as orthogonal as possible you will still probably see degradation in completion quality.


I haven't spent time going deep here but my current hypothesis is that interoperability will more or less end up looking like toolformer where the "tools" are just separate LLM runs with task-specific context. So for example:

> Instruction: Write a poem and them emit a structure that follows a schema named X.

> Completion: [map-schema X "roses are red, violets are blue"]

Conceptually this is basically just a function call where context is local to the function.


I hate that gpt-3.5-turbo is so cheap that using systems like guardrails is a sane thing to do. I can almost always prompt davinci-003 without guardrails in a way to get my exact schema 1-shot, whereas guardrails + 3.5-turbo will often consume 2-4x more tokens, but that still makes it significantly cheaper.


The problem people are having is hitting the rate limits for chat gpt.


Is there a rate limit for the paid API [1] too?

[1] https://platform.openai.com/docs/models/gpt-3-5


You will get rate limit errors on all the models if you push hard enough. I always wrap the API with some exponential backoff retries to work around this.


Thank you for the really clear and complete description of "ControLogit"s and your approach in clownfish!


Thank you for your README, I'm sharing it with my team.


> Bulletproof JSON generation: Jsonformer ensures that the generated JSON is always syntactically correct and conforms to the specified schema.

This is an important definition to take note of: "bulletproof" doesn't mean that you'll get good or correct data. It only means that it'll be valid JSON and in a particular schema that you specify (because the LLM isn't building the JSON in the first place, the library is).

It's an interesting idea. But it's not clear if they've validated the heuristics they use, to see how well it performs in terms of accuracy against, say, some kind of BeautifulSoup-like attempt to make sense of the JSON-ish that the LLM produces and correct that to be valid JSON, or any other approach to the problem.


I wonder if LLMs are at the point where reprompting the LLM with a very similar error message to what you would output to a human from a user-friendly JSON processing tool for the error would usually be a good way to fix errors.


Sometimes, but it very much depends on the context (no pun intended). If it's a pure syntax issue, OpenAI models will almost certainly make the right correction. If it's more abstract, like the LLM has hallucinated a property that is invalid as part of some larger schema you can quickly descend into the LLM gaslighting you into saying that it has fixed things when it hasn't.


Yeah, but that could require multiple queries, which isn't very efficient. Training model just to fix JSON would be better.


Love to see further work on constrained decoding like this and other systems introduced in the comments!

See my work and the paper about it. I've got a lot of y'all beat on this (constrained decoding, not the templating and structuring) by about a year:

https://github.com/hellisotherpeople/constrained-text-genera...


Seen a lot of things trying to do this by pressure testing the outputs, but all feel like anti-patterns. This is the first that seems like the "right" way to do it. Better to manage how the model is generating vs creating one more potentially faulty "glue" layer.


Mathematically it requires less information to impose a certain prior on data in the process of generation than it does to read the data, do error detection and correction according to a prior, and then return it, if I understand correctly.

Something always felt incredibly icky to me about any kind of ad-hoc 'fixer' scripts that were part of a pipeline that was fully controlled by a user.


Can you elaborate about what you mean by pressure testing? Haven't heard this term yet.


Maybe not the right term... Just that a lot of other libs act like guardrails, i.e. let the model generate what it does (in full form text / GPT output), and try to then parse out what you want, error if output doesn't conform to standard format. As opposed to basically only allowing the model to generate into the already-built JSON form fields. Understandable why this guardrails/parsing approach is so popular though... can't do what this library is doing with OpenAI API. Need to be able to manipulate the token generation; otherwise you're forced to take full text output and try to parse it.


I found it rather strange that the new AndrewNG course about prompting, that features an OpenAI employee, says nothing about templated output.

To me this is a killer feature of GPT, being able to turn a document into a json or any other template.

The kind of prompt is just amazing for GPT (try it with a blog post, document or any other thing): "Analyze this document and transform it into the following format:

<title>

<summary (text conciseness: 5/10)>

<content bullet points (text conciseness 3/10)>

<content_item 1>

<content_item 2>

<content_item N>"

Also you can ask the same prompt in a json and GPT will gladly transform a PDF into a JSON.


Templated output means you could build flexible user interfaces to interact in novel ways beyond a mere text input. What I find absolutely incredible right now is that any novel idea I have about "it would be nice if you could do X" is only taking a few days to reach any mainstream tech news source. I used to think the same thing about RubyGem ideas in the late 2000s, and within 6 months a useful package would come out. I put it down to many people consuming the same information at the same time coming up with the same ideas. It's happening much faster this time. 12 months from now who knows what's going to happen!


I knew a similar one called GPTyped, just posted it on HN https://news.ycombinator.com/item?id=35793056#35793057


How about going one step further and constrain transformer output with a context-free grammar? That way you can generate more conformant code such as Python or C.


This may be possible as constraints using constrained beam search, which huggingface has quietly supported for a long time.


Wouldn't even need to beam search if you restrict it to deterministic context free grammars, which would satisfy > 95% of these "generate some JSON schema" use-cases. For DCFGs you can just zero-out the probability for any token that is invalid in the context, no lookahead or search needed. Wouldn't work for truly context free things like most programming languages, though.


Has anyone seen a tool like this that uses Node rather than Python? I have this exact problem in a GPT-based web application I am building and have had to resort to some “creative” solutions. At the very least I am glad to see people are tackling this problem.


I've had good luck with Langchain's output parsers[0], but in addition to the format instructions I also append something like "Do not provide any explanations, just the JSON output.", which helps eliminate content being generated outside of the JSON block.

[0] https://js.langchain.com/docs/modules/prompts/output_parsers...


I created a toy[0] in Typescript that maps LLM responses to type-safe output.

It uses JSONSchema internally, but I’m thinking of revising it to just use Typescript directly after learning more about the ChatGPT plugin implementation (via their hackathon).

[0]https://github.com/jumploops/magic


Same here. Considering switching my project (or at least part of it) to Python. For anything to do with LLMs or ML in general Python has by far the best libraries. JS is probably second, at least for LLM stuff, but it is a distant second place...


Nice tool, will check it out. I had to go through a painstaking trial and error process to generate valid and deterministic JSON for my AI presentation tool called Slide Genie (https://slidegenie.vercel.app/). The hard part was making it work when temperature > 0.


Nice this codifies something similar I've been doing in my prompts! Will be using this instead.

What I currently have been doing:

The JSON template for your response is provided below. The parts to fill out are capitalized. Please do not modify the template. Please fill in the template with one of the above options for your response. <result> { "rating": "N. RATING", "reason": "REASON" } </result>


I actually did this with an silly little app I made that generates fake social media profiles (https://lookface.app). I gave it a prompt telling it what to generate and an example JSON. As long as you say it must be in JSON I haven't had any problems with it generating bad JSON.


Nice job - I've tried to massage the outputs to be structured and sometimes it works, but sometimes it fails badly. Having a more specific set of constraints around it will definitely make it more effective.


I've wondered if despite instruction it forgets due to a context limit. There's a lot I still don't understand. I found if it forgot to format in a certain way you could 'remind it', and it would get back on track, but it's an odd way to think of writing software. Kind of like 'turning it off and on again'.


I wanted to see the opposite - parsing JSON and YAML generated from LLMs. It doesn't happen much with GPT-4 but lesser models might mess up the format and then you can't simply parse it.


It sorta feels like LLMs or some kind of NN should be useful (with training) for parsing malformed jsons. I suspect its a hard problem but honestly it would be such a massive help for those of us dealing with data at work!


Something like this should be integrated with library like https://fakerjs.dev/ With LLM or in general AI based generation of the fake data it can be more diverse and generalized for lot's more applications and help developers My bad if I am unaware of faker having AI based generation already, but afaik it does not have right now


I like the idea of getting ChatGPT to return something easily parse-able by a program. I've been using an XML derivative for that. https://github.com/ColinRyan/Chat-Markup-Language

Never thought to use json schema. I'll check this out!


I might be reading the code wrong but it looks like it crawls the schema making a generation per primitive type. While that’s a clever way to ensure valid JSON, I don’t know if I’d go as far as to describe it as efficient.

Saying that if the model is unable to generate JSON due to its training/fine tuning, this is indeed a clever solution!


It’s clever if you’re running your own models, or only paying for tokens generated.

Since you’re generating the variable fields anyway, it will actually require fewer forward passes even if broken up in multiple prompts than if you generated the static fields as well.

Of course this doesn’t work for OpenAI apis which charge for input context on a per invocation basis.


   Efficiency: By generating only the content tokens and filling in the fixed tokens, Jsonformer is more efficient than generating a full JSON string and parsing it.
I was excited to try this in Replit... and realized it required pytorch. Ouch. Replit was not happy about that!


Is there a way to do something like this but with fine tuning? For example, I want to train a Lora to become a email spam classifier. I have the training data for the prompt as the email and the response as {Boolean:True/False}?


When I ask GPT-4 to give me a response in JSON format it just does it, so I assume that yes you can fine-tune an LLM to handle this.


Very interesting. I've only been using OpenAI APIs so this logit stuff is new to me.


It's a testament to the democratization of ML that practitioners today can get by without knowing what a logit is.


And I am personally glad for that, for one! This means it's accessible to more people without requiring specialized knowledge, and while, yes, I think that always triggers an internal reaction from most of us when it comes to thinking about field dilution, it's almost a necessary tradeoff (like, say, the uncertainty principle) when expanding the field out to more people.

So, hurray! We've made it more accessible. And hopefully in years to come, even very much more so! <3 :)


I'm flabbergasted that OpenAI does not yet offer an API that reliably returns JSON based on some schema you feed it. It's sort of possible to force it to do this but not really to a production-ready degree.


I've complained bitterly and openly about how annoying it is that OpenAI locks down access to the full probability distribution. Glad to see that others are running into this stupid limitation and are doing work related to it.


Trying to understand why this is necessary? LLMs cannot reliably generate valid Jason?


Two ways in which it is useful over existing techniques:

- It is guaranteed to match your schema

- It is much lighter weight


Also, it costs tokens to ask a model to do something, and it may choose not to do it re: constraints

You can force it by banning the vocabulary which violates a constraint for free.


There are typically two issues:

* Occasionally, random junk will get tossed in alongside valid JSON. If you're a human, it's easy to identify and fix. Programmatically, it can be much harder to fix.

* Context lengths can create issues.


Yes, but Mike on the other hand....


Its not very hard through prompting. You can just ask the LLM to generate on these parameters. I did this exact same thing and never wrote any code for it.


This is great that it does not use OpenAI and runs locally


This is pretty cool. I tried with dolly and then I tried with T5-base, both of it did not give me result. It broke for me. Has anyone tried it ?


How does this work? I guess its different from something like fine-tuning because that wouldn't 100% guarantee the right schema?


Is it possible to use this with OpenAI’s models? i.e do they support such in-line token generation?


Fantastic! This makes it easy to let humans write prompts and generate requests an API can consume.


There is no point in constructing a fixed template JSON object like that just to parse it again.


I think I disagree, but your comment lacks context, care to elaborate?


not op, but there's plenty of use cases creating structured JSON on a server then sending it over the wire to the client


I'm trying to imagine what a possible use case would be for this. Any simple examples?


Almost any integration with computer systems would need some structured format, let’s say you want to write an intelligent file manager where you could prompt to “find a file that ends in Final”, and with enough instructions the LLM might give you back something like:

  {
    “command”: “find”,
    “param”: { “type”: “regex”,     “value”: “[^.]*Final\.[^.]*” }
  }

which you can actually interpret and execute for the user. But I’m absolutely a novice at anything ML, so take my comments with a grain of salts.


I hope that this is new to no-one generating JSON using LLM, because it felt like the first thing you'd do when I implemented that kind of stuff. That being said, it's nice to have that as a library ready-to-go.


It may be easier to use with Pydantic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: