Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Structured output from LLMs without reprompting (automorphic.ai)
174 points by sandkoan on July 16, 2023 | hide | past | favorite | 54 comments
Built a tool for transforming unstructured data into structured outputs using language models (with 100% adherence).

If you're facing problems getting GPT to adhere to a schema (JSON, XML, etc.) or regex, need to bulk process some unstructured data, or generate synthetic data, check it out.

We run our own tuned model (you can self-host if you want), so, we're able to have incredibly fine grained control over text generation.

Repository: https://github.com/automorphic-ai/trex

Playground: https://automorphic.ai/playground




the more it goes, the more I realize that the true power of LLMs is not in unstructured text that they can generate, but in structured output. but there are two approaches to achieve this:

1. LMQL/guidance/JSONformer/OP's post

2. finetuning the model to understand function calls and their (potentially) JSON schemas.

there was a comment here about OpenAI's approach (finetuning a model to understand function call) which raised a good point: since finetuning is often forgetful (previous knowledge learnt by the model gets forgotten a little bit), it's not clear if OpenAI's approach has made GPT-4 less capable than it was before. Not to mention that you're still dealing with a statistical process (LLM), not a locked-in algorithm that generates the desired schema 100% the time.

Which brings me to the other approach: steering the LLM's output __as it is generating tokens__, which is what LMQL does. This results in less token usage (you don't send function schema as part of your prompt/message to OpenAI) and 100% accuracy because token probabilities are modified (e.g., 0% chance of any character except ":" after a double quotation mark).


> Which brings me to the other approach: steering the LLM's output __as it is generating tokens__

A relevant PR:

https://github.com/ggerganov/llama.cpp/pull/1773

The plan is to support arbitrary grammar files to constrain token generation, similar to the grammar files here:

https://github.com/antlr/grammars-v4


I mean even Jsonformer used LogitsWarper when generating numbers, but yes arbitrary grammars are infinitely more powerful.


Thank you for bringing up LMQL. We have active branches with regex, parsers and types (structured and other types) which will also soon be upstreamed, improving typed LLM use beyond the current template-based approach we support.


Yes! I don’t have much to say that wouldn’t be restating what I’ve already written, so I’ll link this for reference: https://www.sebastianmellen.com/post/2023/the-killer-use-cas....


I'd never heard of LMQL before today, but it looks very nice. Do you have any experience building with it and, if so, would you be willing to comment on what it's like?


I did a POC project with it recently. The guidance on gpt-3.5-turbo and gpt-4 models isn't as functional as plain gpt-3. I found I had better results using https://github.com/piercefreeman/gpt-json and it doesn't require multiple calls to the API. Not as feature filled, but it may meet your needs


Thanks for the recommendation - gpt-json looks quite nice, actually - will check it out.


has there been any work done to finetune OSS models to behave the same as openai functions (to constrain json output) ?

this is tablestakes now, but it doesnt seem ANY opensource model has this capability


You wouldn't actually want to, because you'd be losing generalizability, and it's a lot of unnecessary work.

I think approach #1 outlined above is the better (more cost- and time-efficient) technique—where a pretrained model already understands JSON (among myriad other formats), and you merely constrain it at text-gen time to valid JSON (or other format).


so im not sure what is the difference between what you wrote and what i wrote ? are you distinguishing between "pretrained models" (as in base models) and finetuned models ?

here's my question then - was the GPT 0613 update (which introduced functions) a completely new base model or simply a finetuned model ? it seems to be the latter.


Yeah, then it seems we agree. I was just pointing out that it's not necessary to finetune OSS models to behave like OpenAI functions if you're able to do something similar to what we did (no tuning involved!).


I think the references to this being a “tool” that you can “self-host if you want” are a little disingenuous, especially after seeing that the linked GitHub project doesn’t mention the fact that it’s just a thin wrapper client making requests to a remote server until you’re halfway through the README. The product might be great, but introducing it to the community in this way doesn’t foster much trust in your company from the perspective of a potential customer.

The only reference I can find to this being a self-hosted model is a blurb in the GitHub README saying “If you'd like to self-host this in your own cloud, email us”. Sure, I can email my OpenAI/Microsoft rep and self-host GPT-4 in my own cloud for enough money too, but that doesn’t change the fact that the primary business model is SaaS. Just be up-front about this fact in community posts, rather than obfuscating it. Your website does a great job with that.


We're offering the API for free, and have the playground available for anyone to use as well. As for self-hosting, we haven't quite decided on the business model, and were trying to gauge interest. For now, we're happy to offer free self-hosting for the first few people, but may charge eventually.

Our intention wasn't to obfuscate this, so thanks for the feedback. We'll try making that more apparent.


We're using a similar approach with OpenAI. The user can define a schema using zod and call a prompt. We're then using OpenAI functions behind the scenes to parse the answer into the shape the user wants. Add JSON schema validation on top and we can be sure that the response conforms to our Schema. Some more details and examples can be found in this blog post: https://wundergraph.com/blog/beyond_functions_seamlessly_bui...


What are the benefits over https://github.com/microsoft/guidance/ ?


We enable conforming to arbitrary context free grammars in addition to regex patterns, and have a bunch of speed optimizations, as well.

Though it may not seem too fast right now on account of the hundreds of simultaneous requests we're getting :)


100s of RPS? did you have a successful launch elsewhere or something? bc this repo currently only has 21 stars, which taking normal correlation into account does not imply that level of traffic


We have folks playing around with it mostly through the playground / raw HTTP endpoints as opposed to the Python API. And we've got some batch jobs running, which adds further traffic.


guidance is a dead project. it worked well as a hobby side project by MS researchers, but it clearly isn't a long-term solution as new LLMs are introduced.


Would you be able and willing to provide a bit more supporting evidence for this statement? We’ve been considering trialing guidance for a few weeks but won’t bother if it’s going nowhere.


I am unclear on the status of the project but here is the conversation that seem to be tracking it: https://github.com/microsoft/guidance/discussions/201


Okay interesting, thanks. There does seem to be some recent activity on the `pythonic` branch but yeah, it does look like the open issues have been going ignored for a while now.


If it helps I tried using it and the basic examples just straight up didn't work at all and regularly got broken in different ways. Even if it's going somewhere it was unusable for a serious project. I moved to custom code and left it on the "watch this project" list for the future.


It does help, thank you. We’re going to put it on pause for now, too.


What about LMQL?


I hadn't heard of LMQL before - have you tried building with it and, if so, would you be willing to share your experience?


Looking at the playground, it appears the few shot examples in the prompt and CFG are duplicative. What is the relationship between the two?

When you say in another comment that using OpenAI functions to output JSON is a waste of tokens, how are you generating the JSON output? And why do your prompts then include few shot examples of JSON objects?


The prompt is given to our model as a guiding aid (a suggestion), and the cfg is used to constrain the model to generate only tokens that abide by the schema (an enforcement). That's how we ensure only valid outputs at text generation time.

We also prefill some tokens depending on the set of allowed tokens at a given state, so the model doesn't waste resources trying to predict them.


When you say “our model”, are you using a custom LLM for completions vs OpenAI or other LLM vendor?


Custom LLM—hence the self-hostability.


What's the difference between self hosting this and manually running https://github.com/r2d4/parserllm ?


Can someone explain what they use structured output for? I’m just curious what kinds of use cases people have found for it.


I use it to create a narrative plan for multiple interleaved plots in an open-world interactive fiction.

For instance: https://pastebin.com/QFZmEAJA

I use Clojure's EDN JSON-equivalent format, and what you can read in this paste is an attempt to make GPT write its own prompt in a conversation where I gradually built a format for this narrative structure using Clojure.

It turns out GPT isn't able to produce EDN data using this prompt (it will produce something that looks like the "grammar" displayed in the paste from above that GPT came up with, not Clojure data as instructed).

I can get it to output EDN but I need to provide an example, but then the story in the example will tend to leak into the generated story. And it still has problems, missing keys for instance, or it doesn't used nested subnarratives, or just fails at outputting strict EDN, for instance forgetting or adding surnumerary parenthesis.

Here's what the EDN structure I want to get might look like:

https://pastebin.com/iTxtn8gk

And here's what kind of text can be generated from it:

https://pastebin.com/qJPjmrTd

For now I haven't even used the parseable EDN programmatically. I just feed it back to GPT as a string (realistically, I'd need to use a vector database to store these narrative blocks). However GPT will slowly erode the structure with every round-trip.


You mention self hosting the model. Do you have the model weights up on HuggingFace?


This is model agnostic, actually—any model on HuggingFace is compatible. So if someone wanted to run this with their own model, they could.


Ok, the way you wrote it implied the model itself is available to self host.


Ahh, no, the value of this isn't as much the model as it is the infrastructure enabling structure enforcement.


Could you contextualise this against OpenAI’s native functions?


Problems with OpenAI:

1) You're wasting GPT tokens on outputting JSON instead of meaningful information.

2) GPT functions won't, with absolute, 100% certainty, return JSON in the schema you want. In 1% to 3% of cases it hallucinates fields, etc.

3) This also allows you to output data in arbitrary non-JSON formats.

4) You can't self-host OpenAI functions.


Thanks, all good points that would seem to make this library a good fit for certain use-cases.

As with the other poster, I’d be interested to hear a bit more about point 1.



Got it, thanks. Certainly a very interesting and active space. I was playing around with FLARE (https://arxiv.org/abs/2305.06983) for RAG this week, and LMQL (mentioned by another poster) seems to use a similar technique.


In response to your sister comment: the implementation we used was the naive one from LangChain (https://python.langchain.com/docs/modules/chains/additional/...). We've decomposed that to use as a starting point but early results are promising, yes, although it doesn't yet seem to be possible to get the necessary `logprobs` out of the GPT-4 API, so we're stuck with 3.5-turbo atm.


Ahh, I've been meaning to try FLARE—was it a marked improvement over traditional RAG?


Point 1 doesn't feel like a good enough reason. The number of tokens outputted as a JSON is so small if you tell GPT to output it properly.


Costs add up surprisingly quickly. A quote-colon-space-quote combo alone is four tokens wasted. Now scale that up....


Using tiktokenizer, these are only two tokens: quote-colon is token 498, space-quote is token 330 (as per https://tiktokenizer.vercel.app/ ). But I agree to the general argument.

I think what factors in even more when you use the API is that you do not have fine-grained control over the generation process. If you follow the MS guidance approach, you fill in structured text yourself, and then let the model generate only the value parts, e.g. up to the next quote. To do that more or less word by word, you have multiple API calls, and have to be very smart about providing the right stop tokens.


Can easily be done with OAI's new function calling.



Don't forget to put a license on your repository.


Thanks for the reminder—done!


oh its 100% predictable alright, predictable to be garbage : the default example chooses height as the wrong "number" whatever that might assume to be, then if you try to change it to define height as perhaps "height in total inches" it still gets it wrong


Ahh, that—due to compute limitations we're forced to run a very small model that isn't as capable of converting 5'8" to inches. The larger model is, though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: