This sounds quite cool, and I am very picky but this seems quite clean.
It wasn't clear if this was for fine-tuning or not, the general explanation could perhaps be clearer, the first few sentences of the README still don't make it very clear (I can sort of guess, having seen DSPy, AutoPrompt etc).
It would be awesome if this DID also explore if it needed to do: prompt tuning, fine-tuning, or soft-prompt tuning etc, I am on the lookout for a tool that does this. Obviously a general open source Q* like solution would be amazing but I get that might be a bit of a different beast! Part of my issue is that there are so many things that can be tweaked and I often don't know the most time and cost efficient thing to optimise. I get that prompt tuning is often going to be the best thing to do, especially first. But for efficient inference, shorter prompts may well be needed. Though maybe clever model key-value caching is starting to make this less of an issue, but it's still faster to have as short as prompts as possible, still, fine tuning or even resuming pretraining may be the best thing to do sometimes.
BTW I would strip `gpt-3.5-turbo` from all examples, as it's more expensive than the better 4o-mini.
I hope to check this out more later.
Nice work!
Thanks for the insightful response. Good point on using 4o-mini to save cost. I'll try it out.
I will check more into the soft-prompt tuning.
For the current scope, we are focused on in-context learning, ways to improve model reasoning at the inference time.
We use auto-differentiative framework (backpropagation) to do zero-shot instruction optimization and few-shot demonstration. currently even just zero-shot can often surpass Dspy's few-shots (as many as 40 shots). And I have come up a training paradigm that will (1) start zero-shot (2) review performance from advanced teacher model to see if we can have a gap to gain from the teacher. (3) if there is a gap to teacher, we start to do low-shot demonstrations, and gradually increase the number of shots.
LLM applications are messy, but AdalFlow has made it elegant!
0.2.0 release highlight a unified auto-differentiative framework where you can perform both instruction and few-shot optimization. Along with our own research, “Learn-to-Reason Few-shot In-context Learning” and “Text-Grad 2.0”, AdalFlow optimizer converge faster, more token efficient, and with better accuracy than optimization-focused frameworks like Dspy and text-grad.
Is AdalFlow also focused on automated prompt optimization or is it broader in scope? It looks like there are also some features around evaluation. I'd be really interested to see a comparison between AdalFlow, DSPy [0], LangChain [1] and magentic [2] (package I've created, narrower in scope).
We are broader. We have essential building blocks for RAG, Agents. But also made whatever you build possible to auto-optimize. You can think of us as the library to do in-context learning. Just like PyTorch is for model-training.
We have better accuracy, more token-efficient, and faster convergence speed. We are publishing three research papers to explain this better to researchers.
We will compare with these optimization libraries but wont compare with libraries like LangChain or LlamaIndex. As they simply dont have optimization and it is pain to build on them.
Thanks for the explanation! Do you see auto-optimization as something that is useful for every use case or just some? And what determines when this is useful vs not?
I would say its useful for all production-grad application.
Trainer.diagnose helps you get a final eval score across different splits of datasets: train, val, test, and it logs all errors, including format errors so that you can manually diagnose and to decide if the evaluation is too low that you need further text-grad optimization.
if there is still a big gap between your optimized prompt vs performance on a more advanced model with the same prompt (say gpt4o), then you can use our "Learn-to-reason few-shot" to create demonstration from the advanced model to further close the performance gap. We have use cases optimized the performance all the way from 60% to 94% on gpt3.5 and the gpt4o has 98%.
We will give users some guideline in general.
We are the only library provides "diagnose" and "debug" feature and a clear optimization goal.
AdalFlow is named in honor of Ada Lovelace, the pioneering female mathematician who first recognized that machines could do more than just calculations. As a female-led team, we aim to inspire more women to enter the AI field.