It's crazy how fast this field moves ! I basically live on HN and between 30%-40% of the terms or metrics I never even heard of (or maybe I just glanced over in the past)
I love articles like these, and how they are able to bring me up to speed (at least to some degree) on the "new paradigm" that is AI/LLM.
As a coder I cannot say what the future will look like (binary views) but I can easily believe that in the future we will have MORE AI/LLM and not LESS AI/LLM thus getting up to speed (at least on the acronyms and core theory and concepts) is well worthwhile.
I think the acronyms and names for things are in constant flux and might not be agreed upon in any large subset of people working in the field. For example I've built a system employing what the author calls "guardrails" and I've never heard of the term. And obviously I've been using RAG, but I've been calling it in-context learning, never saw the need to emphasise that the context is retrieved from somewhere, seems a bit obvious. And I've been looking at how AutoGPT works for inspiration and they call their evals "challenges" so that's how I approached that problem.
Tbf, if you work as a research scientist - a solid part of your work is to read new papers every single day. The ML/AI field is not some unified field, so you will se A LOT of papers from all kinds of researchers and scientists, from all kinds of fields in the world of science.
Sometimes the terms and metrics are borrowed from the (paper) authors domain, other times they're just coined on the fly - if there are no good analogies, or there are too many overlapping or closely related terms. It's like in math where you have a double-digit number of notations for the dot product - all depending on what sub-field of math you're working on.
I understand WHY it is that way, but it's super frustrating - because you end up spending time to look up what the notation / terms mean, often with no real clear answer.
I just skimmed the article, but to me it just felt like they were using new terms for concepts we've already had for a long time in regular software engineering.
That's when you know it's going to be amazing. Single best narrative-form overview of the current state of integrating LLM's into applications and the challenges encountered I've read so far. This is fantastic and must have required an incredible amount of work. Massive kudos to the author.
I'm starting to see a lot of products in "beta" that seem to be little more than a very thin wrapper around ChatGPT. So thin that it is trivial to get it to give general responses.
I recently trialed an AI Therapy Assistant service. If I stayed on topic, then it stayed on topic. If I asked it to generate poems or code samples, it happily did that too.
It felt like they rushed it out without even considering that someone might ask it non-therapy related questions.
I’m happy to believe that the therapy was ineffective, but I don’t necessarily understand why going off topic is bad. In my experience, I had a lot of conversations with therapists that were ‘off topic.’
I’ve definitely talked poetry and writing with a therapist, and while I’ve never had my therapist provide code, we’ve definitely talked tech in great detail.
Maybe those therapists were intentionally making me comfortable by engaging with shared interests. And the LLM isn’t being intentional about it, but I’m not convinced that a therapist is ineffective if they fail to stay ‘on topic’ when directed off topic by their patient.
To extrapolate from my own company and the orders we got from the suits, it basically boils down to them saying "Been hearing about this fancy Chat AI thing, can you whip up something like that quick so we can put out a press release saying $COMPANY is doing AI as well?".
Most corpos couldn't give a rat's ass about it, it's just the fancy new toy on the block that's saturating everyone's newsfeeds so they have to jump on it lest they be left in the dust by the competition who are doing the exact same shit, aka calling the "Open"AI APIs and pretending they're doing something groundbreaking.
We got interrupted mid-sprint, mid-epic to make some shitty wrapper around their APIs. I suspect the overwhelming majority of companies with fancy new "AI" features are doing the exact same shit
There's no way to avoid that except training your own model. It will, likely, always be possible to jailbreak chatbots, or just steer the conversation off course. That's why you must never give them direct access to anything.
I don't see the purpose of this. You will add an additional prompt (cost + latency) just to check if the user is on topic. Why? Why do we need to prevent the users of the therapy bot from generating poems or code samples? Shouldn't we rather spend our efforts optimizing the intended use case?
Evals is not suitable for evaluating LLM applications such as RAG, etc because one has to evaluate on their own data where no golden test data exists, and techniqus used have poor correlation with human judgement.
We have build RAGAS framework for this https://github.com/explodinggradients/ragas
Great project! We're building an open-source platform for building robust LLM apps (https://github.com/Agenta-AI/agenta), we'd love to integrate your library into our evaluation!
For those who don't have 65 minutes, if you write software you are probably familiar with the concepts of evals, caching, guardrails, defensive UX, and collecting user feedback, none of which are really unique to LLMs. The other two items are "fine-tuning" which just means nudging the LLM to be better at responding a certain way, and "RAG" which is a new acronym that just means using the input to look things up in a database first and concatenate them into the prompt so the LLM uses it as part of the context for token generation.
Good note on a design pattern for a LLM based product. The biggest focal points will be if we see evolution of frameworks that tackle the hard parts here.
Evals, RAG, Guardrails often times require recursive calls to LLM's or other fine-tuned systems which are based on LLM's.
I would like to see LLM's and models condensed and bundled up into more singular task trained models - much more beneficial versus system design on using LLM's for applications.
This seems like we are applying traditional system design patterns for using LLM's in practice in apps.
This is fantastic! I found myself nodding along in many places. I've definitely found in practice that evals are critical to shipping LLM-based apps with confidence. I'm actually working on an open-source tool in this space: https://github.com/openpipe/openpipe. Would love any feedback on ways to make it more useful. :)
Basically, get the BM25 results and normalize them to be between 0 and 1, then take the (potentially weighted) average of them and the cosine similarity results (already between 0 and 1) to get the final ranking.
This is going to be only marginally helpful as I don't have references but I think I implemented this in ElasticSearch.
You can do approximate KNN search with ES by adding a setting on the index that enables KNN and then creating mappings for your embedding objects that defines their vector length. Then index your data as you normally would plus embeddings.
Once you have those in place you can construct your query and include the embedding similarity in how the query gets scored. When a query is submitted you embed it and pass into your ES query the embedding as well as the original query. ES will combine all of these elements together to score the results.
TL;DR - doing hybrid vector + keyword search provides more relevant results for text searches than vector search alone. And using sparse vector embeddings for the “keyword” part provides even more relevant results than using BM25.
I've been working on getting LLM-based features out in a production environment for the past few months. This article is absolute gold. Does a great job of capturing several learnings that I think a lot of us are dealing with in silos.
Most of these products are just trivial wrappers around the behemoths, wrappers whose creators either can't recognize or don't even use half the patterns rattled off here.
I'd be more interested in the sales and marketing patterns being employed to hawk the same rebranded wrappers over and over. Ultimately, that's what's really going to contribute most to the success of all these startups.
You'd either need access to the model weights or a fine-tuning API.
Then depending on which fine-tuning approach you want to use, the user data you need to collect will be different: RLHF requires multiple outputs to a single query vs instruction fine-tuning where you need great input-output pairs to train on. You could ask the user's feedback after running the LLM to pick out good training data.
I'm sorry but from a _practical_ standpoint, it feels like mostly fluff. Someone was advertising today on a HN hiring post that they would create a basic chatbot for a specific set of documents for $15,000. This feels like the type of web page that person would use to confuse a client into thinking that was a fair price.
Practically speaking the starting point should be things like the APIs such as OpenAI or open source frameworks and software. For example, llama_index
https://github.com/jerryjliu/llama_index. You can use something like that or another GitHub repo built with it to create a customized chatbot application in a few minutes or a few days. (It should not take two weeks and $15,000).
It would be good to see something detailed that demonstrates an actual use case for fine tuning. Also, I don't believe that the academic tests are appropriate in that case. If you really were dead set on avoiding a leading edge closed LLM, and doing actual fine-tuning, you would want a person to look at the outputs and judge them in their specific context such as handling customer support requests for that system.
What are you even talking about, why would anyone be building chatbots at this point? Chatbots are like the hello world of using an LLM API, it has nothing to do with what this article is about.
Almost every LLM application has an element of conversing or querying an AI based on some knowledgeset or tasks etc.
100% this web page (or similar) will be used to basically scam clients into overpaying for simple wrappers around llama_index or LangChain etc. Some people will spend a week wasting their time trying to fine-tune an open source LLM on some wholly inadequate dataset before realizing they can use OpenAI and something from github. But most will not admit that.
Sure, a few people doing basically research projects for a large company or university will find some of the information useful. But realistically, probably not so much if they have to actually deliver a working system in a reasonable amount of time that would justify the business expense.
No.. this information is for software engineers applying LLM's in their applications. I'm working on an LLM based system, and I'm just soloing a startup with no academic background and it's certainly no research project. I've been doing it for 3 months 1-2 hours per night and I've already applied half the patterns in this article.
I'll grant you that the way the author presents his ideas seems a bit academic, but I assure you all of this information is just the immediate things you run into as a software engineer trying to integrate LLM's into your systems. No one is jumping to finetuning before they've even tried GPT4 on their problem.
1) Dropbox can be replicated in a couple of hours with SFTP (or whatever was that iconic HN comment)
2) the devil is in the details. How do you get the data out of documents? Are they pdfs? Do they have tables? Do they have images? Sure, creating embeddings from text is simple and shoving that into a prompt to get an answer is easy, but getting that text out from different documents can be tricky.
Source: finished a chat to your documents project a month ago
I guess my point was really not to try to emphasize that chat with documents was particular easy for every application but rather just to suggest that the article wasn't particularly practical advice for common use cases.
Oh God, the marketing of barely-understood tech-crafting recipes into new corporate jargon has turned into new acronyms to obfuscate the jargon, and is now accelerating even faster than AI.
Did you miss the NFT train? Have you ever asked yourself if this is what you should be doing with your life?
Just speaking as a guy who actually writes logic and code, rather than like, coming up with incantations and selling horseshit.
We are at the moment in time when LLMs are going from academia to engineering. It's pretty exciting. This is what the process looks like – people reading research papers, doing a bit of practical no-nonsense engineering, and sharing learnings with other engineers.
This will be on youtube as part of various "Top 5 tips to run AI in your app" in about 2 to 3 months.
And yes right now the challenge is figuring out how to productize these LLM things. The researchers are off figuring out what comes after LLMs, we're over here figuring out what to do with these things and how.
Why should I get over it? These people pretending to "research" and "engineer" are just paving the way for total control for a few owners of AI models, who have an unattainable amount of server power.
If you treat coding like learning the cheats of a videogame, you're not a coder, you're not a hacker, you're just a gamer and a fanboi. A consumer of whatever you're given.
I love articles like these, and how they are able to bring me up to speed (at least to some degree) on the "new paradigm" that is AI/LLM.
As a coder I cannot say what the future will look like (binary views) but I can easily believe that in the future we will have MORE AI/LLM and not LESS AI/LLM thus getting up to speed (at least on the acronyms and core theory and concepts) is well worthwhile.
Very Good Article !