Lots of companies building with LLMs and seemingly infinite startups showing off flashy demos with a waitlist. Who's deployed to production, released to everyone, and iterating on them? How are you evaluating them over time and making sure you don't regress stuff that's already working?
At a large government agency, partnered with one of the top AI companies in the world. We've been deploying models and transformers and recently a LLM ensemble in prod for 3-4 years now. Many lessons learnt that open source hadn't really provided utility for.
The biggest backend things: models can run on CPU architecture, you still need highly performant CPU, but in cloud there is a way to significantly discount that by relying on spot instances (this doesn't mean GPU isnt viable, we just found a real cost lever here that CPU has supported... but things may change and we know how to test that). Further, distributed and parallel (network) processing are important, especially in retrieval-augmented architectures... thus we've booted python long ago, and the lower level apis aren't simply serializing standard json-- think protobuffer land (tensorflow serving offers inspiration). BTW we never needed any "vector DB" ... the integration of real data is complex, embeddings get created through platforms like Airflow, metadata in a document store, and made available for fast retrieval on low-gravity disk (e.g. built on top of something like rocks db... then something like ANN is easy on tools like FAISS).
The biggest UX thing: integration into real-life workflows has to be worked on in a design-centered way, very carefully. Responsible AI means not blindly throwing these type of things into critical workflows. This guides what I assume to be a bit of a complicated frontend effort for most (was for us esp as we integrate into real existing applications... we started prototyping with a chrome extension).
EDIT: btw oss has definitely made some of these things easier (esp for greenfield & simple product architecture), but I'm skeptical of the new popular kids on the block that are vc-funded & building abstractions you'll likely need to break through across software, data, and inference serving for anything in scaled+integrated commercial enterprise.
EDIT2: Monitoring, metrics, and analytics architecture were big efforts as well. Ask yourself how do you measure "value added by AI"
> we never needed any "vector DB" ... the integration of real data is complex, embeddings get created through platforms like Airflow, metadata in a document store, and made available for fast retrieval on low-gravity disk (e.g. built on top of something like rocks db... then something like ANN is easy on tools like FAISS)
I think the point is they already had those components or familiarity with them in their system hooked into existing data. You could add semantic search with text vectors 3 years ago within 1 hour and so the need for a service isn't there.
They didn't need a simultaneous second DB for records just matching by a foreign ID key based on a new service.
I have seen it too (as somebody who studied Information Retrieval and these systems and implemented aspects in production at 2 startups, including more recently these semantic search approaches) - you can store vectors (floats) of length 384, 512, 768 or 2048 like most of the current ones easily alongside existing data without typically making a dent in overall data storage, and load into memory for semantic search according to your needs (for example, for big data you can use more bucketing approaches with don't guarantee the nearest neighbour, but one of the roughly closest ones, which is often a trade off worth making).
It's not only not a hassle to do that, but overall superior to using the vector DB online tools for developer experience too. I suspect at 3 million+ rows you begin to have struggles, but barely any services will get there. And if you do, you may try other techniques (dimensionality reduction to 3x smaller often only loses a tiny portion of the data accuracy, but saves 2/3rds on costs), or employ cutting edge embedding techniques like Custom Matrix Optimisation, multi modal alignment, and others that these services don't necessarily offer.
Sure it's good for the beginner/a tutorial website, but it doesn't solve a problem for developers in information retrieval imo.
You mention being careful when applying these AI tools to critical workflows - what kinds of workflows have been successfully integrated with AI and which workflows has it been more difficult?
Re: monitoring: I am wondering what a monitoring platform (from an ops perspective) would look like for an infrastructure like this, and what are the challenges?
helping build https://portkey.ai/ where we try to take care of storing all prompts, responses, erros, costs, latencies etc and make it available to debug.
Thanks for the link! I skimmed the articles on the blog, but couldn't find one describing the monitoring architecture behind all. Do you have any public references for that?
I'm responsible for multiple LLM apps with hundreds of thousands of DAU total. I have built and am using promptfoo to iterate: https://github.com/promptfoo/promptfoo
My workflow is based on testing: start by defining a set of representative test cases and using them to guide prompting. I tend to prefer programmatic test cases over LLM-based evals, but LLM evals seem popular these days. Then, I create a hypothesis, run an eval, and if the results show improvement I share them with the team. In some of my projects, this is integrated with CI.
The next step is closing the feedback loop and gathering real-world examples for your evals. This can be difficult to do if you respect the privacy of your users, which is why I prefer a local, open-source CLI. You'll have to set up the appropriate opt-ins etc. to gather this data, if at all.
We had our customer support solution coming to us showing a fancy AI feature, free of charge just for a PoC. Two weeks after we agreed it was running in prod, and two weeks after that we nuked it. Unless people want a big liability in their business, they should read very well the terms and conditions.
These tools are not mature enough, no automated decision making process should leverage these tools without human intervention, especially when handling other humans. I'm not saying all LLM models are bad, I'm saying I'm talking about their architectural foundation, which is prone to these problems.
Our measurement was the rate of losing customers compared with them opening tickets complaining about some issue. We had roughly 12x more losses when using the LLM as a replacement for our support team in our A/B test.
Well yeah. Nobody wants to talk to an LLM for support, they typically want an actual human to help them. I don't think that says much about the state of an LLM, rather, that people who want help, want it from someone who can take action. Not a LLM designed to be a long winded FAQ machine.
I genuinely do not understand why anyone thinks that LLMs for customer service is a good idea.
I want to avoid having to talk to someone from a business at all costs. The only time I (begrudgingly) reach out to customer support is when I need to accomplish something that I can't do natively through the service. If a LLM can do the task, then just make it part of the service. Why add a LLM middleman?
I do get that I might be a bit weird and there are folks out there that like to talk to someone. But these people are probably even less likely to care for LLMs than I am, since they value talking to a real person.
Because as much as I'd like a button to press to fix my problem, what that problem is, is material to what the LLM or customer service agent chooses to do, and the problem space of things that could possibly be wrong is way larger than any service can hope to program in a page for every single last possiblity.
If you’re talking about Intercom (it sounds like it), I thought they took a grossly irresponsible approach that optimized for splashy “we haz bot” rather than really understanding customer support interactions deeply, and looking for where speed or quality could be improved with LLMs. It really shocked me how poorly they approached this.
We have done deep dives with a couple other providers who are being waaaay more insightful and smart about decomposing the problem.
Wow, that's a huge drop in quality. Perhaps it would make sense to use LLMs as a either a help for your customer support team, or as a fallback for when your customer support team does not have the bandwidth to help more customers at the time.
Also, can you go into more detail on the solution? E.g. how did you tailor it to your business (e.g. fine tuning vs no fine tuning, etc.)
I've been meaning to write up a blog post. I've had a lot of success with:
- Using LLMs to summarize, create structure, etc., rather than relying on them to generate responses
- Having my apps rely on receiving LLM output in an exact format (e.g. an OpenAI function call) and raising an error otherwise
- Never showing generated outputs directly to a user or saving them directly to the database (avoids a lot of potential for unexpected behavior)
This kind of stuff is less flashy than chatbots for everything, but IMO has a lot more potential to actually impact how software works at scale in the next ~year or two.
To add to the first point, creating documentation from bullet points.
Prompting something like “Phrase the following points as technical documentation. Use the Microsoft Technical Writing Style Guide. Don’t add or remove any information” is really good.
Do you publish the points themselves at the top? I'm mystified by the popular desire to have an LLM "expand" their writing. Can't you just publish what you wanted to say?
I use a prompt that says I'm esl to make it sound more like natural english and i found i get less questions and better results on written persuasion at work since i started using it
Building with LLMs does not have to be that different from building on top of any other APIs. (*) The two things that are crucial are:
(1) Observability
Being able to see and store the flow of inputs and outputs for your pipelines. In development, this is invaluable for seeing under the hood and debugging. In production, it's essential for collecting data for eval and training.
(2) Configurability
Being able to quickly swap out and try different configurations -- whether it's "prompt templates", model parameters / providers, etc.
To that end, I have my own internal framework & platform that tackles these problems in a holistic way. Here's a short example:
It is an over-hyped product looking for a way to be applied. At its best a LLM can only ever give you the average answer which may or may not be the correct answer and probably isn't the best answer. We have enough average.
Sometimes it invents answers, which in our environment we cannot ever tolerate. When this major deficiency is fixed and it can simply say, "I don't know" then we can consider working with it. One false positive makes it ineffective as a knowledge tool. It would be analogous to a report from an unknown source which contains bad data. You are best burning it and starting over with data you can trust, rather than attempt to fix it.
After an initial first rush of getting features to prod, at this point I would say we're doing "Evaluation Driven Development". That is new features are built by building out the evaluations of those results as the starting point.
At least from the people I've talked with, how important evaluations/tests are to your team seems to be the major differentiator between the people rushing out hot garbage and those seriously building products in this space.
More specially: we're running hundreds/thousands of evaluations on a wide range of scenarios for prompts we're using. We can very quickly see if there are regressions, and have a small team of people working on improving this functionality.
Interesting! How would you say you bootstrapped your evaluation system(s)?
This is a great perspective. We managed to bring our error rate (that is, there's always a valid result) down to about 4% without evaluation-driven prompt engineering, but it did involve looking at real-world usage every day, noticing patterns, and doing our best (ugh, this is where evals would have been nice) to not regress things. Combined with some other pieces - basically things that end users can customize definitions for, which we parameterize into our prompt - this seemed to get it very close.
The problem seems to be right now that LLMs are 98% predictable. OpenAI functions for example invoke properly 90%+ of the time, but every now again, even with nearly identical input, it will not invoke.
These problems become multiplicative, if you have 3-4 chatgpt calls that are chained together your failure rate is closer to 10%.
Unfortunately in the health tech space, even 1% failure rate is unacceptable. I have some theories and PoC work I am doing to improve the rate. As with all "AI/ML" projects its not as simple as take problem and apply AI to solve it.
Exactly! With even just 3 chained calls, 98% individual chance of failure ^ 3 calls ~ 94.1% joint probability of success or 5.9% failure global rate. That's a failure rate of 1 out of 17.
Problems like those and my experience as a reliability statistician are why I built ModelGym: https://modelgymai.com/
I want to help LLM app developers deliver on their visions and accelerate the realization of AI's potential.
It's free to try. If you have any feedback, I'd love to learn of it.
Can you not simply reattempt? Then 10% failure appears more like volatility in the latency 10% of the time. Of course 1% of the time you have to reattempt twice, .1 three times etc.
A failure isn't just a 503, it's that instead of producing one of your function calls you specify using OAI functions, it instead produces a chat message.
In some contexts it may make sense to have calls that can result in a function call, or just a chat message. One approach I am trying is to reduce the number of arguments, another is to reduce the number of function calls available.
Either way, all of this introduces complication to building with OpenAI.
Ah I see - you don’t know a priori whether a text response is appropriate or a function call. Maybe I’m just being cheeky but what if you had an echo() function and asked that it always put chat responses through the echo function. That way if it breaks protocol you know?
I work for a large Fortune 100 company. About 100k employees. Everyone was sent an email from the very top saying that all LLM/AI tools were forbidden without written authorization from Legal and somebody in the C-suite. The basic rationale was they didn’t want to risk anyone using some else’s IP and didn’t want to risk the possibility of the tools getting a hold of our IP.
I'm seeing this a lot. It's a great example of how the older non-FAANG large companies are going to just keep making it harder to actually be effective. Good oppurtunity for startups who are competing with them though as the AI tools get better and these companies can't or won't adopt them.
Except, we are a manufacturing company. We deal in designing and manufacturing real products. We sell approximately $5k worth of physical product each second. Maybe a startup with AI tools could compete with us, but I'm not holding my breath.
Cheat layer has thousands of global users running our agents, and we started building agents over a year ago. We were the first to get approved by openAI to sell GPT-3 for code generation in summer 2021. We're growing MAU 40% MOM and MRR 50% MOM. We won #1 product of the day 4/1 on Product hunt, and users can pay to access our no-code agents that uses a unique project management interface to build actually useful agents for businesses.
Some improvements we've made:
1) Agents use a global supabase to store "successful" code from tasks so new users don't have to go through regenerations to find the same solution.
2) Each user has their own supabase to grow valuable business data, and our autonomous sales agents can now maintain conversations across email, sms, and voice calls. This is our defensive moat vs Microsoft, since they can't hire overnight to copy this data and can't sell it back to customers. This allows agents to implicitly learn things like "high converting copy" and our new users don't have to start from 0 when starting a new business. We even had a customer reply to one agent 5+ times: https://twitter.com/CheatLayer/status/1676959310562349059?s=...
3) We designed a new 'guidelines' framework that dynamically constructs the system prompt based on context, which allowed us to push the limits in specific use cases.
5) Most exciting recent update was adding voice synthesis to our autonomous sales agents, who can continue the conversations across email, sms, and now voice. https://www.youtube.com/watch?v=2s4iQ_joToc
6) We setup a testing framework for the guidelines I mentioned above to make sure we constantly iterate. In this way we can also prove the old GPT-4 model is definitely worse and some automations don't work at all if you take full advantage of the new model.
You're not missing anything, if anything that's a great example of why you shouldn't use an LLM to handle your whole sales pitch without supervision, the person being pitched to is clearly not interested in the product but asks if they want to make a video with them (probably as a form of paid content) and at one point the LLM starts to reply like it is the one selling the video...
Awful example to use, in the end the LLM here would have ended up buying a video for $80
Actually we've been using automated GPT-4 powered responders in production to get demos on auto-pilot for months, and we have the data to prove it. It's significantly better than Apollo.io because we actually used it in conjunction with Apollo before we switch to completely personalized outbound.
We have lists of demos and sales closed with this already I can happily show anyone who wants to see over a call.
Look, I don't mind at all when I get a cold email that is thoughtful and well targeted. But LLM written emails are very obvious, and irritate me to no end because I feel like this is the future of marketing – endless waves of automatically written spam, generative voice phone calls, and who knows what else
I think LLMs have a huge potential for changing the way we work, and creating hybrid human/AI interfaces where the AI does the boring parts and the human becomes more productive as a result – but with these agent systems it feels you're aiming to completely replace the human and instead go for quantity over quality
At Salesforce and Experian I worked with some exceptional sales people, who would be very likeable and turn a cold email into a $1m/yr contract. These people would be very smart and leverage their network, background and expertise. Maybe AI can completely replace them one day, but in the meantime I think the real money in this area is building tools that one of these professionals can use to 2x, 3x, 10x their workflow.
We had a sales team who did exactly this before these agents, and they got demos on auto-pilot without touching apollo.io or salesforce, because we used GPT-4 to respond like they would using their history of responses. You can set this up yourself using the webinar I linked above. We setup a calendly link with a round-robin scheduler to the whole sales team, and they constantly get demos on auto-pilot.
Parker Harris was actually one of my early mentors, because my first co-founder was the first investor in Salesforce(Halsey Minor). A lot of his advice went into building this.
We're not selling to the CEO, we're selling to the VP of sales who can now a/b test this, or the startup founder who can't afford a sales team yet, and I'm very sure a large percent of reps will fall below the line vs the auto-responders. We increased the number of demos our sales team saw with just GPT-4 auto-responding, then we doubled the total reply rate with personalized outbound. It makes sense that it wouldn't have the exact same reply rate. I'd be happy to give you access to get your insight on the direction we should go if you're interested.
Hi yes this is correct, the lead list 1500 youtube partners to form partnerships and sponsorships, and it actually formed over a dozen deals. You can see the setup here: https://www.youtube.com/watch?v=uj-gH4f6RUM
This agent was built for influencer partnerships to help us growth hack
We're also using it to sell Cheat Layer, and getting 10% reply rates vs the cold static outbound in Apollo.io
Who has deployed their LLM in production and is profitable right now with their LLM startup and has at least $500K+ MRR in less than a year and is bootstrapped with zero VC financing?
We need to cut through the hype and see if there is sustainable demand from the abundance of so-called AI startups that seem to keep reporting that they are unprofitable or continuously loss making for years.
I would wager there's not a lot here. The company I work for has been live since May with a major LLM feature and it's not a direct contributor to MRR, but we have had folks in the sales org specifically call it out as helping in the sales process, and some data indicating it gets people new (and in our Free tier) more likely to convert to paying customers.
But I don't think we'll see what delivers true business value until next year when a lot more companies have been live for a while.
Langchain is so immature in this area - it doesn't even have a plugin for logging. When we hit production general availability, we realized how bad the assessment/evaluation/observability of this technology is. It is so early, and there are so many weird things about it compared to deterministic software.
What we ended up building looks something like:
- We added a way to log which embedded chunks are being referenced by searches that result from a prompt, so you can see the relationships between prompts and the resultant search results. Unlike a direct query, this is ambiguous. And lots and lots of logging of what's going on and where it gets or more importantly where it doesn't look but should. In particular you want to see if/where it reaches for custom tool vs. just answering from the core LLM (usually undesireable).
- We built a user facing thumbs up/down with reason mechanic that feeds into a separate table we use to improve results.
- We added a way to trace entire conversations that includes both the user facing prompts, responses, as well as many internal logs generated by the agent chain. This way, we can see where in the stepped process it breaks down and improve it.
- We built what amounts to a library for 'prompt testing' which is basically taking user prompts that leads to hits (that we want to maintain quality for over time) or misses (that failed due to something not being as good as it could), and we run those for each new build. Essentially it's just a file for each 'capability' (e.g. each agent) that contains a bunch of QA prompts and answers. It doesn't generate the same answer every time even with 0 temperature, so we have to manually eyeball test some of the answers for a new build to see if it is better, worse or the same. This has enabled a lot, like changing LLMs to see the effects, that was not possible without the structured unit-test-like approach to QA. We could probably have an LLM evaluate if the answers are worse/better/the same but haven't done it yet.
- We look at all the prompt/response pairs that customers are sending through and use an LLM to summarize and assess the quality of the answers. It is remarkably good at this and has made it easier to identify which things are an issue.
- We are using a vector DB which is weird and not ideal. Thankfully, large scale databases are starting to add support for vector type data, which we will move to as soon as we possibly can, because really the only difference is how well it indexes and reads/writes vector type data.
- We have established a dashboard with metrics including obvious things like how many unique people used it each day and the overall messages. But less obvious things too like 'Resolves within 3 prompts' rate, which is a proxy for basically does it get the job done or do they constantly have to chase it down. Depending on the problem you're solving, similar metrics will end up being evolved.
There's probably other things the team are doing I'm not even read into yet as it's such an immature and fast moving process but these are the general areas.
Mmm, the LangChain comment is interesting. We never used it, even in prototyping. Then there's so much out there about how it's a giant mess you shouldn't ever let near production. I wonder how many folks are realizing that afterwards and then feeling burned!
> It is so early, and there are so many weird things about it compared to deterministic software.
100% agreed. We have a somewhat similar approach for our product. Part of it feels like it could be abstractable into other environments, but it's so nascent that it's really hard to say for sure. The only thing I'm completely confident in is that you can't debug this stuff with traditional debuggers.
Each of the problems you describe will become a slew of startups and open source projects. Compared to the typical web stack things are really immature and it'll be interesting to see how everything shakes out and fills in.
Gotta be honest, I have no clue what I’m doing. We just pushed to prod and finding stuff out while we go. Logging all conversations and because we have no idea what we are doing, just trying different stuff.
For context, I build in some api calls and LLM is asking for the required data. My proof of concept was enough to make the client say, ship it! And we’re rolling with it since
I think you're in good company there. Kudos on seizing the opportunity. I built a tool that can help with this. It's free to try and I'd love to learn of any feedback you have: https://modelgymai.com/
OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals
Would love to hear feedback and thoughts on how people approach monitoring in production in real world applications in general! It's an area that I think not enough people talk about when operating LLMs.
We spent a lot of time working with various companies with GenAI use cases before LLM was a thing and captured them in our library called LangKit - it's designed to be generic and pluggable into many different systems, including langchain: https://github.com/whylabs/langkit/. It's designed beyond prompt engineering and aims to provide automated ways to monitor LLM once deployed. Happy to answer any questions here!
I don't think this differs for LLMs vs any other ML model in production. Folks working in "MLOps" have been thinking about this for quite a while (and most people don't use langchain in production, so that's at least one thing to put aside, fine for protoyping of course).
We've built effectively a proxy between us and ChatGPT so that we can observe what value users are getting out of it, and to take advantage of more advanced features later. A lot of this work has been speculative, like, we think/hope the information we're collecting will be of use to us, but nobody can articulate precisely how or has a plan for where to start.
For instance, we keep track of edits that users make to generated summaries so that we can later manage fine-tunings specific to users that we might pass along to the model, but I couldn't describe a plan for making use of them today.
There are several areas where we are seeing LLMs being deployed successfully in production. Often those are in comparatively boring areas, like NLP, classification and intent recognition, for which LLMs work quite well. So far most of this adoption is at companies with existing ML expertise in-house.
We haven’t seen any usage of the more exciting-looking demos of semi-autonomous agents in production yet, and probably won’t for some time.
Productionizing LLMs is a Multi headed chimera. From getting the vector db retrieval in order, to proper classification, monitoring, and reliable generations. Still searching for a solution to many of these problems, but https://www.parea.ai helps with the testing, eval, iteration part. Wonder what folks use for the vector db retrieval part?
Weights & Biases Prompt tool [1] is useful for easily logging all the interactions with LLMs (supports LangChain + Llamaindex too). It's particularly helpful for debugging since it logs all the information we need to understand and debug.
W&B makes a lot of sense for model training/tuning analytics and tracking, but seems like major overkill for a lot of other logging. Have you tried alternatives or were you using W&B already so you went with it? What is the major logging you're doing (types of info being logged) if you aren't training a model?
https://flowch.ai is live and is a hit with early users. As users supply their own prompts regressions in LLMs aren't really an issue, we're currently using both GPT3.5 and GPT-4, both have their place.
Rapidly moved from demos to people actively using it for their day to day work and automation :)
We're taking advantage of new features as they become available, such as OpenAI's functions and larger context windows. Things are evolving quickly!
I’ve found most advanced LLM are pretty resilient to abbreviations as well. “Tho” can replace “though”, and you can even be as racy as “b4” instead of before. Depending on the model it can start degenerating into abbreviating talk in response for obvious reasons, but for GPT4 it doesn’t seem to be impacted.
It took us quite a bit of work to reduce the static part of our prompt from ~2300 tokens to ~500 tokens. I wish there was a good playbook for this. I used gpt-4 a bit for some ideas, but almost all of it was just getting extremely creative with terse ways to represent info and cut cruft.
Additionally, we've learned that LLMs lose a lot of context in the middle of a prompt, and so it might also be that a bunch of data not near the beginning or end of a prompt could just get removed entirely, or at least much more tersely summarized.
Sorry I’ve been doing this a bit adhoc - there are other techniques I do as well such as choosing shorter synonyms. A structured approach might be doing a dictionary translation by inverting a thesaurus and mapping synonyms to either themselves or a shorter word. There may be some “texting culture” thesaurus in the world that you could merge with a standard one.
But it’s also remarkably resilient to dropping single characters or even a few so long as the spelling is very similar. Wish I had a more rigorous approach for you though.
The biggest backend things: models can run on CPU architecture, you still need highly performant CPU, but in cloud there is a way to significantly discount that by relying on spot instances (this doesn't mean GPU isnt viable, we just found a real cost lever here that CPU has supported... but things may change and we know how to test that). Further, distributed and parallel (network) processing are important, especially in retrieval-augmented architectures... thus we've booted python long ago, and the lower level apis aren't simply serializing standard json-- think protobuffer land (tensorflow serving offers inspiration). BTW we never needed any "vector DB" ... the integration of real data is complex, embeddings get created through platforms like Airflow, metadata in a document store, and made available for fast retrieval on low-gravity disk (e.g. built on top of something like rocks db... then something like ANN is easy on tools like FAISS).
The biggest UX thing: integration into real-life workflows has to be worked on in a design-centered way, very carefully. Responsible AI means not blindly throwing these type of things into critical workflows. This guides what I assume to be a bit of a complicated frontend effort for most (was for us esp as we integrate into real existing applications... we started prototyping with a chrome extension).
EDIT: btw oss has definitely made some of these things easier (esp for greenfield & simple product architecture), but I'm skeptical of the new popular kids on the block that are vc-funded & building abstractions you'll likely need to break through across software, data, and inference serving for anything in scaled+integrated commercial enterprise.
EDIT2: Monitoring, metrics, and analytics architecture were big efforts as well. Ask yourself how do you measure "value added by AI"