Built a tool for transforming unstructured data into structured outputs using language models (with 100% adherence).
If you're facing problems getting GPT to adhere to a schema (JSON, XML, etc.) or regex, need to bulk process some unstructured data, or generate synthetic data, check it out.
We run our own tuned model (you can self-host if you want), so, we're able to have incredibly fine grained control over text generation.
Repository: https://github.com/automorphic-ai/trex
Playground: https://automorphic.ai/playground
1. LMQL/guidance/JSONformer/OP's post
2. finetuning the model to understand function calls and their (potentially) JSON schemas.
there was a comment here about OpenAI's approach (finetuning a model to understand function call) which raised a good point: since finetuning is often forgetful (previous knowledge learnt by the model gets forgotten a little bit), it's not clear if OpenAI's approach has made GPT-4 less capable than it was before. Not to mention that you're still dealing with a statistical process (LLM), not a locked-in algorithm that generates the desired schema 100% the time.
Which brings me to the other approach: steering the LLM's output __as it is generating tokens__, which is what LMQL does. This results in less token usage (you don't send function schema as part of your prompt/message to OpenAI) and 100% accuracy because token probabilities are modified (e.g., 0% chance of any character except ":" after a double quotation mark).