I predict there will be another six months of these sorts of articles, accompanied by a raft of LLM-powered features that aren't nearly as transformative as the people currently hyping AI are telling us to expect.
The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.
The main thing LLMs can do is make products accessible/useful to a wider range of users - either by parsing their intents or by translating outputs to their needs.
This might result in a sort of transformation that engineers and power users aren't geared to appreciate. You might look at a natural language log query and say, "that would actually slow me down". But if it makes Honeycomb suddenly useful to stakeholders who couldn't before, it could lead to use cases not on the radar right now.
I probably could have elaborated more on this in the blog post, but you can really distill a lot of Honeycomb's success as a business down to a few things:
- How easily can you query stuff when you're interested
- How easily can you get other people on your team to use the product too
- How quickly can you narrow down a problem (e.g., during an outage) to something you can fix
- How relevant is your alerting (i.e., SLOs) to the success or failure of something business critical
Our bet here is that the first two could potentially be improved by using LLMs, since we hypothesized (and confirmed in some new user interviews) that there's an "expressivity gap" in our product. A lot of people who aren't already observability experts, but do have some vested interest in observability, often know what they want to look for but get confused by a UX that's tailored for people who are more familiar with these kinds of tools.
It's only been 3 weeks so it's too early to tell, but we're seeing some signs that the needle is being moved a bit on some key metrics. We're not betting the farm on this stuff just yet, and it's really cool that there's technology that lets us experiment in this way without having to hire a whole ML engineering team.
Since you are here, want to say this is one of the most useful posts I've seen about pragmatic development on top of LLMs!
And agreed re: development effort - compared to other hype cycles of AI, it's important for folks to understand that the results they see are coming at a fraction of the experimental budget.
The last chatbot wave (when FB opened up Messenger API and when Microsoft slung SKYPE as a plausible bot platform and when Slack rebranded their app store) fizzled out after 18 months.
All to figure out the singular most important thing: chat interfaces are the worst.
I agree that chat interfaces are not great, but we shouldn't reduce LLMs to implementations in chat interfaces. For example, the "autocompletion" I get with copilot is a very useful tool that I use daily, and I think that sort of UX could be built into plenty of other interfaces. Most applications where you input some form of text could benefit from LLM AI.
Yes, this exactly. That's why we didn't go with chat for our UX here, and for future product areas we likely won't either. We already have good UX for our kind of product and haven't seen much feedback or convinced via some other means that adding chat would help more than it would hurt.
One of the weirdest parts of using Bing Chat is that it has tab-to-autocomplete function that is almost always wrong about what I want to say. I wish there was an LLM that actually was an "autocorrect on steroids" because that's honestly one of my most-anticipated features of this technology.
Having an LLM spell-checker that would autocorrect my spelling as I typed, based on the context of what I was typing? That would be magnificent.
Yeah, Copilot serves this purpose wonderfully—I've actually started writing documentation straight in VSCode and even occasionally things like certain emails, Jira tickets, or just general notes pertaining to anything vaguely technical, solely because Copilot is quite good at acting as a technical writing assistant.
Since it's just OpenAI's text completion model with a code finetune and without the chat/assistant RLHF, it works much better as an "advanced autocomplete" than ChatGPT or even OpenAI's Turbo model via their API. I can be much more surgical with how I use it (often accepting just a few words at a time), and it's good at following my usual tone.
Can’t agree more. As an user, a chatbot makes me think the company has put some kind of dumb parrot in front of me in order to avoid giving actual support.
This makes me think someone could "pull a Google" with LLMs though. I mean we all know the original pitch for Google, right?
There were search engines. Plenty. And indexes. And ... That wasn't it.
Every site had search. That also wasn't it.
The problem was that these things were incredibly low quality versions. Exactly what you're complaining about now with chatbot interfaces. In fact that these were such low quality is what made Google such an incredible opportunity: centralization came almost built in. They never needed to fight books.com: their search sucked. Even now finding a book using Google's interface works better than all search on the internet, except perhaps Amazon. And Amazon's "shittifying" it's search engine too now ...
Google was also never "the best". It was simply consistent quality that worked everywhere. And because all other search engines were pretty far along in their enshittification cycle, with MBA's unwilling to go back, nobody even made a serious attempt at fighting them. Except, perhaps, and very late to the party, Microsoft.
Google was better, but not incredibly better (and it's been going downhill for like 5 years now). It makes me think that if you could make something that would advise 10% of the world population on how to boil eggs, that would be incredible.
At my org we use a chatbot for pull requests — you get pinged by the bot when the PR is ready to merge, with a button in the chat interface that merges the PR — no need to open GitHub and locate the big green button yourself.
That won’t 10x your productivity or whatever, but it does make it slightly more pleasant.
That does sound cool, but I’m not sure that’s what most folks mean by “chatbot”. My understanding was that a chatbot is an automated chat program that will generate responses to your messages, simulating a live human.
I suspect you're right for how people are using and deploying LLMs now: hacking all kinds of functionality out of a text-completion model that, although it encodes a ton of data and some reasoning, is fundamentally still a text completion model and when deployed via commercial APIs like today without fine tuning, are not flexible beyond prompt engineering, chaining, etc. make possible.
But I think we've only scratched the surface as to what LLMs fine-tuned on specific tasks, especially for abstract reasoning over narrow domains, could do.
These applications possibly won't look anything like the chat interfaces that people are getting excited about now, and fine-tuning is not as accessible as prompt engineering. But there's a whole lot more to explore.
Am an ML engineer, was around long before LLMs. Definitely am skeptical, and I think I know where one dimension that we're missing performance is, a number of people do (in regards to steerability/controllability without losing performance), it's just something that's quite hard to do tbh. Quite hard to do indeedy.
Those who figure out how to do that well will have quite a legacy under their belts, and money if they're a profit-making company and handle it well.
It's not about whether or not it can be done, it's not hard to lay out the math and prove that pretty trivially if you know where to look. Just actually doing it is the hard part in a way where the inductive bias translates appropriately to the solution at hand.
The problem is, of course, these systems are fundamentally incapable of human-level intelligence and cognition.
There will be a lot of wasted effort in pursuit of this unreachable goal (with LLM technology), an effort better spent elsewhere, like solving cancer or climate change, and stealing young and naive people’s minds away from these problems.
I don't think I want or need human-level cognition in my software. It turns out what I want is fancy autocomplete that you can few-shot teach to do all manner of useful NLP things, which is what LLMs are giving me today.
Trying to analogize the bullish case for LLMs as I perceive it:
If you had an intern at your beck and call, who was obedient but knew nothing about anything except how to use a search engine, would that make you a "10x" at whatever you do?
Even if we accept your premise, LLMs don't have to make you 10x in order to succeed. I think even 1.05x will do, considering their price point is 20 bucks a month.
Exactly - if I can’t trust it (as has been my experience with 90% of queries) so that I have to still Google its answers to verify them, what good does it do? It isn’t really improving my productivity in a substantial way, or can indeed hinder it.
That said, my beef is not with its use for this purpose - whatever makes people happy and feel more productive, it’s all good and fun. The problem is when the CEO of a company like OpenAI starts talking with a straight face in his Congress testimony how this not-really-new tech has the potential to run afoul and enslave us (I’m heavily paraphrasing, obviously). That is the problem - this particular CEO suddenly feels entitled enough to make a giant PR stunt by using his official testimony to endow his product with the qualities it cannot possibly have - that is the problem. The blatant lies and deception is frankly disgusting.
dinvald said, “The problem is, of course, these systems are fundamentally incapable of human-level intelligence and cognition.
There will be a lot of wasted effort in pursuit of this unreachable goal (with LLM technology), an effort better spent elsewhere, like solving cancer or climate change, and stealing young and naive people’s minds away from these problems.”
In response I said, “While we’re at it let’s encourage people to stop making music and peanut butter as well.”
Then dinvald said, “Your point?”
Can you explain to dinvald what my point was?
ChatGPT:
Absolutely. Your response is a form of reductio ad absurdum, a strategy where one extends an argument to an absurd extreme to point out its flaws. In this case, you are challenging dinvald's claim that efforts towards developing Large Language Models (LLMs) is a waste of resources that could be better spent on solving other problems.
By suggesting we also discourage people from making music and peanut butter, you are demonstrating that dinvald's argument could be applied to any human endeavor that isn't directly tied to solving major global issues. Your point is that not every pursuit needs to be measured against solving the world's largest problems, like cancer or climate change. You imply that just like music or peanut butter, AI development also has its place and value.
The problem is not with the existence of LLMs, it's with how they're marketed - as a "reasoning" engine and something that can "think". There's no cognition in it, and very unreliable information.
If you want to eat a peanut butter sandwich that's marketed as a panacea that can cure all diseases, while not even knowing if the butter has long expired, then all props to you, but I don't.
We've all lamented the disinformation campaigns made by humans recently, well this is like a disinformation engine on atomic steroids! And no, they're not going to "solve it" by feeding it more data and parameters - this problem is inherent in LLMs, and in fact gets harder the more data there is.
Of course, as always, anything that comes out of Silicon Valley is a fad that is shoved down our throats as the best thing since sliced bread. But in reality, it's completely unnecessary, as it only marginally improves the quality of life of the 0.1% of the population, while simultaneously decreasing it for the rest. That is the sad reality we live in, my friend.
This is a very impassioned post, but I'd note as a researcher who's been working in the field for a while -- LLMs achieve to a degree an approximation of the basis of the task space that generates the distribution at hand under the bound of the minimum descriptor length, achieved by the L2 penalty. This can be shown to a certain point reasonably straightforwardly.
Hence, as a distilled set of operators in task space, that disambiguation then becomes "reasoning" for a number of people. We're not saying they're human. Under reasonable mathematical constraints you can show pretty clearly reasoning. Heck, even without it you can have good inductive tests.
I'm not sure the core pith of what you're getting at but I have a suspicion that there's something else under there other than the raw mechanics of how LLMs do or don't work. Would this be a correct assertion on my end? :) <3 :D :)
I prefer not to engage further on this topic, to be honest. It's not a productive use of our time.
Instead, to get more educated on it, I'll read Gary Marcus's "Rebooting AI: Building Artificial Intelligence We Can Trust" that discusses research on AI reasoning in more detail ;)
I will probably also look into "Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass" by Mary L. Gray and Siddharth Suri, which discusses already-existing huge ethical violations around building these AI systems by large corporations like Google and Microsoft-backed OpenAI :D <3
If you have any other good reads on this (particularly the 2nd topic) based on your experience, I'd greatly appreciate it!
I think this is more the societal layer of it, which is definitely important. There's a lot of wannabe "Impact of AI" people so I tend to be generally distrustful of that space due to the huckster coefficient. I think the basic summary is we can look back to the industrial revolution, though this time with white collar work and a few things modified to the present and that will land near the general consensus that a lot of people will have.
If you're looking for the more technical side of things (reasoning, etc) then I'd recommend taking a look at the original Shannon and Weaver paper, plus information topology from there on out. That field alone is interesting enough to dive down to a Ph.D. in.
Re the societal layer, it's not about the wannabe people though - there are real documented cases of how these big systems were trained, using deeply unethical practices (e.g. outsourcing processing of highly disturbing images to "ghost" workers in Kenya while paying them <$2/hour, by OpenAI). These have been studied by folks with PhDs in the field, so not something to be simply discounted (take a look at https://www.dair-institute.org/publications, for example).
Re reasoning, I'm more curious what you think of the non-statistical school of thought, e.g. the arguments against LLMs as "reasoning" engines, as popularized by Noam Chomsky and Gary Marcus.
> The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.
This will be entertaining (if dangerous) to watch, as people hopefully become disillusioned with the overselling and overhyping of this not-new-tech-but-wrapped-in-marketing-bs wave of ‘AI’. But, history shows we rarely learn from past mistakes.
I also hope there will be some whistleblower within OpenAI (and others like it) that exposes its internal practices and all of the hypocricy surrounding it. And, usually the fish rots from the head, as they say.
Is there something specific your second paragraph refers to? I agree with the first one but I don't see a clear basis for the second. Do you just hope they're doing something untoward? And even if there is some dirt on them, how could it possibly relate to the quality of llms overall? Unless gpt4 is really just a room full of smart people typing really quickly...
For a business that lies and deceives the general public from the very beginning about the capabilities of their technology, it can only mean they’re as unethical on the inside as on the outside. If they were truly honest and open, they would not be behaving the way they are. These two sorts of behaviors are simply incompatible with each other.
To be brutally honest, I expected better from Sam, but he has lost all credibility in my eyes based on how they chose to roll out ChatGPT. I now see that he’s even more hawky and daring and manipulative than Zuckelberg ever was.
I tried building with LLMs but it has the basic problem that it’s totally wrong 20-50% of the time. That’s very meaningful for most business cases. It does fine when accuracy isn’t important but that’s fairly rare other than writing throwaway content
That to me feels like there's some prompting improvements that you could do. It could be that your problem is just harder for LLMs than ours, but 20-50% of the time isn't what we observed after several prompting changes. The other thing is that we do regularly get outputs that are "mostly correct" and we can actually correct them manually, so while the model may have a higher fault rate, the actual end-user experience has a lower one.
But you probably didn’t use GPT 4 because its API is available only to a select few.
“I tried a known-bad early iteration of a technology that has already been superseded — hence it is bad and will always be bad.” is not a convincing argument.
The first transistor was an ugly and impractical thing too.
I have access to OpenAI's GPT4, and there appears to be no significant difference from GPT3 (or GPT3.5). It looks like we are pretty far into the curve of diminishing returns already.
ideally for marketing purposes. It can generate a several slogans, or posters and you can pick the one you like most.
However if it needs to generate a graph, and the graph needs to be accurate, you're out of luck.
This is a great summary of why productionizing LLMs is hard. I'm working on a couple LLM products, including one that's in production for >10 million users.
The lack of formal tooling for prompt engineering drives me bonkers, and it compounds the problems outlined in the article around correctness and chaining.
Then there are the hot takes on Twitter from people claiming prompt engineering will soon be obsolete, or people selling blind prompts without any quality metrics. It's surprisingly hard to get LLMs to do _exactly_ what you want.
I'm building an open-source framework for systematically measuring prompt quality [0], inspired by best practices for traditional engineering systems.
If you have a task that requires something suggested by "__exact__", then a full LLM is probably not the answer anyway. Try distilling step by step, especially if the goal to to generate a DSL or some restricted language. It can be helpful to have a different set of tokens available to the model for decoding, such that the only possible outcome is something like 'ATTCGGTCCCGGG' given some question to predict a DNA sequence.
For sure. I'm dealing with fuzzier stuff, more in the sense of "don't refer to yourself as a chatbot", "this input should trigger X tool", and things of that nature.
Any thoughts on managing costs? I've been developing against gpt-4, and it runs up charges quickly. I've been thinking I will need to be careful about adding live api calls in any sort of testing situations.
Wondering if your tool has any features to help avoid/minimize wasted api usage?
I could add a couple things from my own experiences. Storing prompts in a database seemed like a good idea, but in practice it ended up being a disater. Storing the prompt in a python/typescript file, up front at the top, works well. Using OpenAI playground with it's ability to export a prompt works well, or even better, something in gradio running in vscode with debugging mode, works even better. Few shots with refinements works really well. LangChain did not work well for any of my cases, I might go as boldly as saying that using langchain is bad practice.
It's delightfully hacky, but we actually have our prompt (that we parameterize later) stored in a feature flag right now, with a few variations! I actually can't believe we shipped with that, but hey, it works? Each variation is pulled from a specific version in a separate repo where we iterate on the prompt.
We're going to likely settle on just storing whatever version of the prompt is considered "stable" as a source file, but for now this isn't actively hurting us, as far as we can tell, and there's a lot of prompt engineering left to do.
Do you have a recommendation of how to easily connect a language model to a python repl, apify, bash shell, and composable chaining structures if not langchain? I find those structures invaluable but am curious where else I could build these programs.
The current trend for productionizing LLM-based applications is to write your own (really thin) wrappers around the actual LLM call. The majority of your code should be business logic independent of the LLM, anyway: information retrieval, user interface, response logging, and so on.
In my experience with cloud-based and local models, LLM chaining compounds errors; I would urge you to look at few-shot, single-query interaction models for business applications using LLMs as a new unit of compute.
It’s great for prototyping and seeing what is possible but for running in production you’ll likely need to write it yourself, and it will just take a few minutes.
Could be we're in a (short?) interregnum analogous to pre-Rails Ruby: there are lots of nascent frameworks, but the dominant one hasn't been born yet. FWIW - DIY is working well for me.
The first problem, context window size, is going to bite a lot of people.
The gotcha is that it’s a search problem. The article mentions embeddings and dot product, but that’s the most basic and naive search you can do. Search is information retrieval, and it’s a huge problem space.
You need a proper retriever that you tune for relevance. You should use a search engine for this, and have multiple features and do reranking.
That’s the only way to crack the context window problem. But the good news, is that once you do, things get much much better! You can then apply your search/retriever skills on all kinds of other problems - because search is really the backbone of all AI.
Such a great comment. The nice thing about this is that if search was a key part of your product pre-LLM, you likely already have something useful in place that requires very little adaptation.
Exactly. LLMs are incredible at information processing, and ok-to-terrible at information retrieval. All LLM applications that rely on accurate information are either infeasible or kick the entire can to the retrieval component.
Yeah, we're definitely learning this. It's actually promising how well a very simple cosine similarity pass on data before sending it to an LLM can do [0]. But as we're learning, each further step towards accuracy is bigger and bigger, and there doesn't appear to be any turnkey solutions you can pay for right now.
The way to think about it is this: you can scan an entire book for the part you’re looking for (a huge context window), or you can look it up in the index in the back (a good retriever). The latter is a better approach when you’re serving production use cases. It’s faster and less expensive.
I would argue a more appropriate title would be something about integrating LLMs into complex products.
A lot of the problems are much more easily solved when you're working on a new product from scratch.
Also, I think LLMs work much better with structured data when you use them as a selector instead of a generator.
Asking an LLM to generate a structured schema is a bad idea. Ask it to pick from a set of pre-defined schemas instead, for example.
You're not using LLMs for their schema generating ability, you're using them for their intelligence and creativity. Don't make them do things they're not good at.
I don't go directly to an LLM asking for structured data, or even a final answer, so you can type literally anything into the entry field and get a useful result
People are trying to treat them as conversational, but I'd say for most products it'll be rare to ever want more than one response for a given system prompt, and instead you'll want to build a final answer procedurally.
Also wanted to say this is a really cool tool, ty for mentioning it.
I fall into the category of developers using LLMs every single day, for both answering questions while working, and also for more exploratory “bounce ideas off the wall” exercises.
Everytime I find a new way to explain to the LLM how I want us to work together, I feel like Ive unlocked new abilities and usecases I didnt expect the model to have.
Some examples for those curious:
* I am interested in learning more about X because I want to achieve Y. Please give me an overview of concepts that would be useful to learn about X to achieve Y. [then after going back and forth fleshing out what Im interested in learning] Please create a syllabus for me to begin learning about X based on the information you've given me. Provide examples of additional materials I can study which already exist, and some exercises to test and operationalize my knowledge.
* [I find that the above can often make the model attempt to squeeze all the info into a single response, which compresses the fidelity of the knowledge and tends towards big shallow lists, so I will employ this trick] I want you to go deeper into each topic you have listed, one at a time. When I say “next” move onto the next topic
* You are my personal coach for X, here is context about the problem I want to work on and my goals. This is our first coaching session, ask me any questions you need to gather more information, but never more than 3 at once. Where should we start?
Just wanted to say that I checked out your app, and it’s really impressive! When building it, did you bootstrap by asking it what developers like me would want out of a site like that?
It actually came of my own use of the default ChatGPT interface: I was working on an indie game in my spare time and using it to spitball new mechanics with personas
But it was really tedious to prompt ChatGPT into being properly critical about an idea that doesn't exist: A basic "make me a persona" prompt will give you an answer, but if you can really break down the generation of the persona (ie. instead of asking for the whole thing, ask who are the people likely to use X, what's the range of incomes they have, etc) you get a much better answer
The site just automates that process and presents chats that are seeded with the result of that process so the LLM is more willing to imagine things. For example, if a persona complains about a feature, when can hit 'Chat with X' and interrogate them about it instead of running into 'As a LLM' you should get an actual answer
Something we're looking to experiment with is asking the LLM to produce pieces of things that we then construct a query from, rather than ask it to also assemble it. The hypothesis is that it's more likely to produce things we can "work with" that are also "interesting" or "useful" to users.
FWIW we have about a ~7% failure rate (meaning it fails to produce a valid, runnable query) after some work done to correct what we consider correctable outputs. Not terrible, but we think the above idea could help with that.
Based on my personal experience I think that's a much better approach, so I wish you luck with it.
Maybe somewhat counter-intuitively to how most people view LLMs, I strongly believe they're better when you constrain them a bit with some guardrails (E.g. pieces of a query, a bunch of existing queries, etc).
Happily surprised you guys managed to get it down to only a 7% failure rate though! For how temperamental LLMs are and the seeming complexity of the task that's impressive.
> Happily surprised you guys managed to get it down to only a 7% failure rate though!
Thanks! It, uhh, was quite a bit higher before we did some of that work though, heh. Since we can take a query and attempt to run it, we get good errors for anything that's ill-specified, and we can track it. Ideally we'd address everything with better prompt engineering, but it's certainly quicker to just fix stuff up after the fact when we know how to.
Re: constraints, it turns out that banning tokens in a vocabulary is a great way to force models to be creative and follow syntactic or semantic constraints without errors:
They cite "two to 15+ seconds" in this blog post for responses. Via the OpenAI API I've been seeing more like 45-60 seconds for responses (using GPT-3.5-turbo or GPT-4 in chat mode). Note, this is using ~3500 tokens total.
I've had to extensively adapt to that latency in the UI of our product. Maybe I should start showing funny messages while the user is waiting (like I've seen porkbun do when you pay for domain names).
Was this in the past week? We had much worse latency this past week compared to the rest (in addition to model unavailability errors), which we attributed to the Microsoft Build conference. One of our customers that uses it a lot is always at the token limit and their average latency was ~5 seconds, but that was closer to 10 second last week.
...also why we can't wait for other vendors to get SOC I/II clearance, and I guess eventually fine-tuning our own model, so we're not stuck with situations like this.
I've seen more errors lately I think, but no the latency has been an issue for months. I think it has grown some over the last few months, but not a dramatic change.
There's no real benefit to streaming if you are planning to use the LLM output downstream (say, in a SQL query). LLM latency is a major annoyance right now, whether locally-hosted or cloud-based.
We had a hack-a-thon at my company around using AI tooling with respect to our products. The topics mentioned in this article are real and come up quickly when trying to make a real product that interfaces with an AI-API.
This was so true that there was an obvious chunk of teams in the hack-a-thon who didn’t even bother doing anything more than a fancy version of asking ChatGPT “where should I go for dinner in Brooklyn?” or straight up failed to even deliver a concept of a product.
Asking a clear question and harvesting accurate results from AI prompts is far more difficult than you might think it would be.
I'd call all of these things specific cases of some of the general problems we've faced with using neural networks for years. There's a big gap between demo and product. One one hand OpenAI has built a great product, on the other hand, it's not yet clear if downstream users will be able to do the same.
- Creating, storing, and updating an embedding of a schema that people query against
- Creating an embedding of the user's input
- Running a cosine similarity against the user input embedding and each column in a schema, then sorting by relevancy (it's a score from 1.0 to -1.0)
- Using the top n most "relevant" columns instead of passing the full schema
So far, there's some pros and cons. On the pros side, it's really fast and lets us generally be more accurate for schemas that are very large, since those can get truncated today. We've seen in some cases it can also help reduce LLM hallucinations. On the cons side, it's another layer of probabilistic behavior and still has the chance of "missing" a relevant column. We can't really say for sure if it's better overall in our test environment, so we're going to just test in production and flag it out if it's yielding worse results.
They give the LLM access to a search engine powered by embeddings so that it can pull in relevant info. These embeddings could be created numerous ways: TF-IDF, Word2Vec, BERT, GPT.
Engineering will always be hard, but I think a lot of this current AI hype cycle doesn't even have a product - its just "well that's cool so I want that."
I don’t think that’s very generous - I think it’s “wow that’s amazing I’d like to find a way to integrate it” which I think is perfectly reasonable given it is amazing (even though I think it’s overestimated in its current form, and underestimated due to its current form)
That's the same thing - integrating Google Search in your product was hot at one point but brought no value, people would just use Google instead of your site search anyway.
Working towards having a chat box on every website is not a useful outcome for users or businesses, just OpenAI.
Yes but that’s seeing it as a finished product rather than a demo of a limited technology. The effort here wasn’t using it as a chat box either. The effort here was using natural language generate non trivial queries that are semantically aware of the underlying trace system without the human learning the complex query language. I’m surprised if that’s hard see the value of and how it’s different than a chat box.
When “it could be used for anything” really means only that they haven’t found a market fit and are just a solution in search of a problem, as with most venture-baked enterprises.
I know at every company I’m working with they’ve found pretty useful and interesting fits for this tech. It fills a space that’s been impossible to date - an abductive reasoning ability in an abstract semantic space. While they can’t actually reason - any more than our inductive or deductive reasoning systems of yore - the missing piece in a lot of stuff has been the ability to navigate an abstract space and find a likely “meaning” then produce a likely output that’s semantically “accurate.” The optimizing, constraining, informing via goal based agency, information retrieval, etc - those are simply integrations, as this article discusses. By looking at LLMs as a finished product you missed the magic. It’s not a product, it’s a capability in a larger system being displayed in demo ware. The larger systems are where the magic happens. Don’t take my word for it - while we will see more hype than we ever have before, we will also see systems that transcend what was possible by amazing leaps and bounds. The jaded are both right, and profoundly wrong - as are the wild eyed dreamers.
Search was already solved - I could find anything I wanted using any number of search engines within seconds, and it provided much more exact information.
LLMs are purely suited for search precisely because they can't guarantee to "find" information that actually exists in the real world, but they do add a lot of unnecessary noise, making it harder to weed out the truth, not easier.
The engineers I know whose job it is to implement LLM features are much more skeptical about the near future than the engineers who are interested in the topic but lack hands-on experience.