Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are some actual use cases of AI Agents right now?
169 points by chenxi9649 9 months ago | hide | past | favorite | 149 comments
There are quite a few start ups/OSS working on making LLMs do things on your behalf and not just complete your words. These projects range from small atomic actions to web scrapers to more general ambitious assistants.

That all makes sense to me and I think is the right direction to be headed. However, it's been a bit since the inception of some of these projects/cool demos but I haven't seen anyone who uses agents as a core/regular part of their workflow.

I'm curious if you use these agents regularly or know someone that does. Or if you're working on one of these, I'd love to know what are some of the hidden challenges to making a useful product with agents? What's the main bottle neck?

Any thoughts are welcome!




> I'd love to know what are some of the hidden challenges to making a useful product with agents?

One thing that is still confusing to me, is that we've been building products with machine learning pretty heavily for a decade now and somehow abandoned all that we have learned about the process now that we're building "AI".

The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.

You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.

Go look at some of the LLM benchmarks out there, even in these happy cases it's rare to see any LLM getting above 90%. Then consider you want to chain these calls together to create proper agent based workflows. Even with 90% accuracy in each task, chain 3 of these together and you're down to 0.9 x 0.9 x 0.9 = 0.73, 73% accuracy.

This is by far this biggest obstacle towards seeing more useful products built with agents. There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.


> The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable.

I think that ChatGPT's success might be partly attributable to its chat interface. For whatever reason, a lot of people - including me! - are much more forgiving of inconsistencies, slip-ups, and inaccuracies when in a conversational format. Kind of like how you might forgive a real human for making a mistake in conversation.

I don't think that's necessarily good, and might not have much connection to attempts to build new non-conversational products on top of LLMs, but maybe it has some explanatory power for the current situation.


I don't know if I'm more forgiving of inaccuracies in a conversational interface, but I'm way less likely to notice them in the first place. Especially since the current crop of RLHF'd models are so eager to please that they say nearly everything with high confidence.


I think this is a more realistic notion of what AI danger is rather than the X-risk. It's our tendency to trust things given a certain means of presentation. There are certain things where error is okay, but many things where even a small error is huge (like the OP is mentioning). The danger is not so much in the tool itself but us using the tools in a lazy manner. It isn't unique to ML/AI, but ML/AI uniquely are better at masking the errors. It's why I dislike the AI hype.


Yeah I think a lot of people think RLHF is a tool for increasing accuracy but it really is training it to be convincing.


Which should be rather obvious if you understand that it's basically a GAN. You have a discriminative model who's objective function is based on Justice Potter's description of porn: I know it when I see it. If you ask what types of errors might emerge from such a formulation I think it becomes rather obvious.

Which isn't to say that the tools aren't useful. But I have to add this fact because many people conflate any criticism with being dismissive of the technology. But technology is about progress, not completion. Gotta balance criticism and optimism. Optimism drives you and criticism directs you.


> I think that ChatGPT's success might be partly attributable to its chat interface. For whatever reason, a lot of people - including me! - are much more forgiving of inconsistencies, slip-ups, and inaccuracies when in a conversational format. Kind of like how you might forgive a real human for making a mistake in conversation.

The key term here is "conversation". If I query something from the machine and it disappears and rumbles and then prints off something like a 1980s mainframe, with paper that has those holes on the side that you tear off... and then it's wrong, it's wasted time.

Meanwhile with the conversation I'm watching it in real time, and can stop it, refine it, or ask or clarification immediately and effectively. There is an expectation of give and take and "talking through" things to get to an answer, which I find is effective. I don't need it to be 100% right all the time, just 80% and then start parsing answers out of it to refine it to 90% accuracy with high confidence.


Personally having been a big fan of GPT-3, I was quite against ChatGPT because of this.

Completion models are obviously wrong very often. Instruct model was kinda ok, but you know it's a dumb machine.

Chat was a bit of an uncanny valley. I treated the instruct model like a child, but chat felt like having a conversation with someone of 80 IQ. It felt frustrating, and you ended up going "no no no, what I meant WAS ..." It felt like dealing with an incompetent colleague.

But I guess there's lots of views on it. Some expected it to be an oracle, even a god. Some treated it like Stack Overflow, then got frustrated that it was giving poor quality answers to poor quality questions. Some were just abusive to it. I suppose it's a mirror in a sense.


Though I wonder how much of that is just that the format doesn’t encourage you to look closely enough at what you’re getting to see if it is right.


There are several reasons to forget:

  - copilots are useful
  - chat is entertaining and useful
  - future tech is coming
  - investment money


This has been a perfect description of my experience doing this. I had written some code to go through reasonably complex web onboarding flows and it basically played out exactly like you predicted in your comment. In addition, I've been working with some vendors that have been trying to do the same thing and they're finding that it works out just like you describe.

The handwritten automations have performed better and the issues are reproducible, so even when there are issues, there's some sense of forward progress as you fix them. With handing it all over to an agent, it really feels like running around in circles.

I think there's probably something here, but it's less trivial than just tossing a webpage at chatGPT and hoping for the best.


One interesting thing about LLMs is that they can actually recover (and without error loops). You can have a step that doesn't work right, and a later step can use its common-sense knowledge to ignore some of the missing results, conflicting information, etc. One of the problems with developing with LLMs is that the machine will often cover up bugs! You think it's giving sub-par results but actually you've given it conflicting or incomplete instructions.

Another opportunity is that you can have less steps or more shared context. One interesting thing about Whisper is that it's not just straight speech recognition but can also be prompted and given context to understand what sort of thing the speech may be about, increasing its accuracy considerably. LLM Vision models also do this with things like OCR. This might not help it with the individual digits in an account number, but it does help with distinguishing an account number from a street address on a check.

Or to take another old-style ML technique, you probably shouldn't be doing sentiment analysis in some pipeline, because you don't need to: instead you should step back and look at the purpose of the sentiment analysis and see if you can connect that purpose directly with the original text.

All that said, you definitely can write pipelines with compounding errors. We haven't collectively learned how to factor problems and engineer these systems with LLMs yet. Among the things I think we have to do is connect the tools more directly with user intention (effectively flatting another error-inducing part of the pipeline), and make the pipelines collaborative with users. This is more complex and distinctly not autonomous, but then hopefully you are addressing a broader problem or doing so in a more complete way.


> You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect.

You are assuming that the probability of failure is independent, which couldn't be further from the truth. If a digit recogniser can recognise one of your "hard" handwritten digits, such as a 4 or a 9, it will likely be able to recognise all of them.

The same happens with AI agents. They are not good at some tasks, but really really food at others.


The "food" typo is just too good to ignore in this context.


And the US Post Office and other postal services have been using this tech to sort letters for several decades now (although postal codes with both letters and numbers like Canada's are harder). It was viewed as the "killer app" for ML in the 1990s.


This thread is an object lesson in the point I'm making: people have forgotten everything we've learned about making ML based products.

Parent comment doesn't understand the concept of expectation, and this comment is apparently unfamiliar with the fact that SotA for digit recognition [0] has been much higher than 90% even in the 90s. 90% accuracy for digit recognition is what you get if you use logistic regression as your model.

My point was that numbers that look good in terms of research often aren't close to go enough for the real world. It's not that 90% works for zip codes, it's that in the 90s accuracy was closer to 99%. You have validated my point rather than rejected it.

0. https://en.wikipedia.org/wiki/MNIST_database


> The biggest thing any ML practitioner realizes when they step out of a research setting is that for most tasks accuracy has to be very high for it be productizable. You can do handwritten digit recognition with 90% accuracy?

It's way more nuanced than this. Of course, you need a decent "accuracy" (not necessarily the metric), but in many business cases, you don't need high accuracy. But you need a solid process: you can catch errors later, you can cross references etc, you need to failsafe, you need to have post-mortem error handling, etc...

I shipped stuff (classical ML) that was nothing more than "a biased coin flip," but that still generates value ($) due to the process around it.


Yea that's a good point.

Now I am curious, what are some tasks that can accept a model that is at 80% as good as a human, but is 100x cheaper?(or, 100x faster?)


Similar to the sibling comment, helpdesk ticket routing.

The volume of helpdesk tickets large enterprises deal with is very easily and vastly underestimated. If you can even route 30% away from the central triage with 90+% accuracy and drop everything else back to the central triage... you suddenly safe 2 FTEs in that spot in some places. And increase customer satisfaction for most of those tickets because they get resolved faster.

Or, as much as people hate it, chatbots as a customer front. Yes, everyone here as an expert in a lot of tech has had terrible experiences with chatbots. Please mark your hate with the word "Lemon" in the comments. But decently implemented chatbots with a few systems behind them can resolve staggering amounts of simple problems from non-techies without human interaction from the company deploying them. It remains important to eventually escalate to humans - including the history from all of these interactions to avoid frustrations, sure.

Or, ticket/request preprocessing. Remember how spelling that 10 digit account number to a call center agent hard of hearing sucks? Those 4 retries because of you not using a better way to communicate that number also costs the company. Now, you can push a few of these retries into an AI system. If you mail them, an AI system can try to extract information like account numbers, intent, start of the problem, problem descriptions and such into dedicated fields to make the support agents faster.

Companies are certainly overdoing it at the moment, I'm not denying that. But a lot of the support/helpdesk pre-screening can be automated with current AI/ML capabilities very decently. Especially if you learn to recognize and navigate it.


and this is why consumers lose a bunch of money, so corporations can save $3-6 per help desk ticket. then consumers get stuck in a bot interface and can never break out of it (speaking from personal experience)


The last company I worked at eventually became a Big Tech. In the beginning though, we used to ask all engineers to pair with customer service folks to deal with ticket triage. When we got a bit bigger, we used to have rotations where eng would pair with customer service folks. Being on the other side of that was very eye-opening for all eng. Many used to come in with the same bias that you see on this site, that how dare you be routed to some automated service and how inhumane the service is. On the other side you see competent CS agents absolutely swamped with low-level questions that were often literally answered in docs and FAQ pages. I think getting transformer-based triage models correct can unlock tons of value.


Hard agree - i've heard frankly staggering "per support ticket" costs from every company i've worked for, or has publicly talked about customer support costs. Think $3-6 dollar per customer support ticket.

A UK company, Octopus has been doing some interesting work on GenAI <> customer support, which is helped by their "Energy provider in a box" software called Kraken (https://octopusenergy.group/kraken-technologies), which gives a single unified view over their operations.

They even have support agent level of personalisation - i.e; the agent will talk in the tone of voice of a given agent via fine-tuning of their chat history.


You group tickets into root cause and the future projected cost will now fully fund fixing the issue. Most companies, however, look at customer support a cost center instead of valuable insight it is.


I taught customer service / software engineers to process tickets from the singular queue and eliminated routing. Worked surprisingly well.

I have yet to see a chatbot in a customer service function that isn't strictly worse than a button. Usually, the button is "request refund / return" for whatever reason. It's like captcha stuff, web site owner is too dumb (no offense) to figure out how to handle spam so they offload that task onto the customer.


Lemon


Well, an old one is OCR, especially handwritten OCR. I'm doing genealogy. There is SO MUCH old handwritten material that is never transcribed, and which requires special expertise to read (old and exotic handwriting styles) and interpret (place names, writing conventions, abbreviations).

It doesn't have to be perfect. It's not as if the actual data in there is perfect. It just has to be in a form where I can search it, ideally with named entities mapped.

Quality - like deciphering the writing on scrolls buried in volcanic ash in Herculaneum - gets all the attention. But what I really want is quantity - I want to be able to search through those 5000 pages of 200 year old mildly damaged cadastral records in dense handwriting. I want to relieve the army of kind retirees who currently transcribe these sorts of documents one by one based on their own needs.


bruh what are the retirees supposed to do once you've automated their hobby


A ton of tasks. Call centers to start with (they already do[1]), with human fallback.

1: In my country, after ChatGPT launched last year, when you call customer support you are now prompted to “just say in a few words” what you want instead of going through tap-this-number menus (they exist as a fallback) and I believe the backend is an LLM. The user flow and voice recordings are still programmatically determined though, but I can easily see one streamlined model calling APIs and whatnot, handling it all.


I speak English clearly. These things always tell me to repeat what I said. Never once has it ever worked for me. I want to throw my phone at the wall.

Also I think this has been around for longer than chatgpt. It is often accompanied by a fake keyboard clicking noise.


> I speak English clearly. These things always tell me to repeat what I said. Never once has it ever worked for me. I want to throw my phone at the wall

Now imagine how well it works for people with non-"native" accents (even for native, I'd guess that a good Scouse/Glaswegian/Kiwi accent might confuse the hell of those systems as well). It's a disaster and I hate those.


There's a general problem across the tech industry with replacing existing simple and reliable (in a sense of conveying the user's intention) interfaces like physical buttons with stuff that is supposedly "more natural" like speech recognition or swipe gestures that in practice has a much higher error rate.

See also: replacing physical buttons with convoluted swipe gestures on mobile devices in the never-ending quest to make screen as large as possible. When was this ever a user ask?

I feel sometimes like the present UX design is one large LLM-like hallucination.


Yea that's pretty cool too, I heard some restaurants are also doing a 100% voice LLM to take orders.

Transcription, specifically Whisper, is one of those ML models where the accuracy is basically on-par with humans. So, I really expect a lot more to come out of real time voice/LLM integrations.(the ChatGPT voice thing is a good glimpse, but very janky)


"A firm providing AI drive-thru tech to fast food chains actually relies on human workers to take orders 70% of the time": https://www.businessinsider.com/ai-drive-thru-tech-relies-on...


Scan a menu, look for the different entrees, identify the most probable ingredients, determine health content. Then: allow people to search for food based allergies, food aversions, calories. Generate pictures of what the food might look like, display the pics next to the food to make it more likely a user will buy that food.


Until one of your customers' children eats a peanut that the AI didn't infer would be an ingredient, and dies.

Generating fake pictures also seems like it would be more ordinary false advertising.


also requires the resto or manufacturer to list all ingredients, and most already list potential allergens.

but yeah liability would scare me, esp. because without it you can put the liability squarely on the restaurant or in some cases the person ordering / asking / buying.


I'm not sure this argument is in any way specific to LLMs, and the space for their application is still enormous. Search results, ad targeting, recommendation systems, anomaly detection, content flagging, and so on, are all systems using machine learning with a high false positive rate.

Up until fairly recently many systems used non-LLM models for making decisions based on natural language. Their performance would have been far worse but they still did useful work. Examples would include content policy enforcement, semantic search and so on.

There are very many cases where a system will make an automated decision on a heuristic or random basis for lack of better options. ML improved those decision points and spawned new ones. LLMs improve a subset of those decision points and spawn new ones.


The last widely used AI tool was facial recognition, a technology widely used in fields such as company clock-ins, access control, surveillance, and more, and it is so trusted that facial recognition is often the sole method for clocking in. These facial recognition systems can maintain an extremely high accuracy rate for every entry and exit of thousands of people in a database every day. Now when will LLMs achieve such accuracy?


They have… they write language and they are good at it. The problem is language is not reality, proper language does not mean truth or fact. The models predict what is most likely to come next, not what is most likely to reflect reality.


> what is most likely to come next

> not what is most likely to reflect reality.

Shouldn't there be a strong statistical correlation between the two? And, isn't that, fundamentally, more about intent of the training? If I train a model to predict what comes next in reality, it's through next word prediction, but it is predicting what reflects reality the best.


You're 100% right - but I do think there are more lower accuracy cases than I initially expected, *especially* if you assume a human-in-the-loop. Still 10x better than status quo.

Ex. Content generation + zero-shot classification/mapping are powerful, and with a human in the loop (somewhat) responsible for accuracy, they can move much faster.


> There are cases where lower accuracy results are acceptable, but most people don't even consider this before embarking on their journey to build an AI product/agent.

what do you think would help people consider this before going down that path?


I use already a few ai tools even without perfect accuracy.

And a LLM who only needs to call to a few API calls isn't hard.

Very little need perfect accuracy and for that we still have classical software.


You use them successfully because your human mind can filter out the junk. It would only take one inaccurate API call that charges your credit card $10k or sells your car for 10 cents to cause a lot of damage to your life.


Which is why even with classical software, most of us don't have APIs where that's all it takes.


Frankly I think you've said it all here - a properly designed API + a well designed LLM interface on top of that enables non-technical people to do things they otherwise couldn't.


curious to know, which tools do you use and how do you use em?


I use copilot for coding.

CharGPT for writing emails to bigger audiences, Gramma correction.

Image generator for fun.

LLM for Feature Extraction from random text like a website or PDF (llama).


The first think any ML practictioner realizes is that accuracy is about the single worst performance metric you can use for most real-world tasks, lol


Can you explain this? Why is that?


Let me sell you an amazing cancer test my friend. It's 99.5 percent accurate.

It works through an incredibly novel mathematical technique. You simply go to the patient and tell them they don't have cancer.

Since 99.5 percent of people don't have cancer, this classifier is 99.5 percent accurate.

Completely bullshit classifier for a completely bullshit metric.

Use sensitivity/specificity or precision/sensitivity instead


> You can do handwritten digit recognition with 90% accuracy? Sounds pretty good, but if you need to turn that into recognizing a 12 digit account number you now have a 70% chance of getting at least one digit incorrect. This means a product worthy digit classifier needs to be much higher accuracy.

Language is essential for human civilization, so are tools. We wouldn't get far without either.

maybe a language model can understand what it needs to do but not how to do it, so you give it a tool.

Humans can get pretty far without 100 percent accuracy, we can get a lot from AI models before they reach 100 percent, but being that at some point AI will be able to improve itself even remake itself daily with 2x the abilities, 100 percent or at least 99.7 percent is attainable.

Right now I can take any YouTube video summarize it and turn it into a podcast, short form videos, and a blog post.

There's definitely a lot of marketing uses right now for ai agents. If you think about embodied AI, it's only as good as it's body. if it doesn't have good grippers it will struggle to pick things up.

Also with a lot of things, accuracy is subjective one person might think ad copy is great and maybe their manager thinks it's shit. One person could give it a 100 percent score and another a 70 percent.

My point is we're so close here, and it's already amazing technology and we can augment failures by creating larger toolboxes.


None of these I've seen actually works in practice. Having used LLMs for software development the past year or so, even the latest GPT-4/Gemini doesn't produce anything I can drop in and have it work. I've got to go back and forth with the LLM to get anything useful and even then have to substantially modify it. I really hope there are some big advancements soon and this doesn't just collapse into another AI winter, but I can easily see this happening.

Some recent actual uses cases for me where an agent would NOT be able to help me although I really wish it would:

1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.

2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.

3. Illustrator Agent - Image generation for above story. Images end up very "LLM" looking, often miss key elements in the story, but one thing is worst of all: no persistent characters. This is already a big problem with text, but an even bigger problems with images. Every image for the same story has a character who looks different, but I want them to be the same.

4. Publisher Agent - Package things together above so I can get a complete package of illustrated stories on topics available on web/mobile for viewing, tracking progress, at varying levels.

Just some examples of where LLMs are currently not moving the needle much if at all.


>even the latest GPT-4/Gemini doesn't produce anything I can drop in and have it work

This is certainly true for more complex code generation. But there are a lot of "rote" work that I do use GPT to generate, and I feel like those have really improved my productivity.

The other use case for AI-assisted coding is that it _really_ helps me learn certain stuff. Whether it's a new language, or code that someone else wrote. Often times I know what I want done, but I don't know the corresponding utility functions in that language, and AI will not only be able to generate it for me but also through the process teach me about the existence of those things.(some of which are wrong lol, but it's correct enough for me to keep that behavior)


> 2. Story Generator Agent - Write a story from a given outline (for educational purposes). Even at a very detailed outline level, and with a large context window, kept forgetting key points, repetitive language, no plot development. I just have to write the story myself.

You have to break it down into smaller steps and provide way more detail than you think you do in the context. I did an experiment in story generation where I had "authors" that would write only from the perspective of one of the characters that was also completely generated starting first from genre, name, character traits, etc. Then for a given scene, within a given plot and where in the story you are, randomly rotate between authors for each generation, appending it in memory, but not all of the story fits in context. And each generation is only a couple hundred tokens where you ask it to start/continue/end the story. The context contains all of this information in a simple key:value format. And essentially treat the LLM like a loom and spin the story out.

Usually what it produces isn't quite the best, but that's okay, because you can further refine the generation by using different system/user prompts explicitly for editing the content. I found that asking it to suggest one refinement and phrase it as a direct command, then feeding that command with the original generation, works. This meta-prompting tends to produce changes that subjectively improve the text according to whatever dimensions specified in the system prompt.

If you treat the composition as way more mechanical with tightly constrained generation, you get a much better, much more controlled result.


> 1. An agent to automate generating web pages from design images - Given an image, produce the HTML and CSS. LLMs couldn't do this for my simple page from a web designer. Not even close, even mixing up vertical/horizontal flex arrangement. When I cropped the image to just a small section, it still couldn't do it. Tried a couple LLMs, none even came close. And these are pretty simple basic designs! I had to do it all manually.

That’s because none of the models have been trained on this. Create a dataset for this and train a model to do it and it will be able to do it.


https://www.youtube.com/watch?v=bRFLE9qi3t8

Here's the CEO of Builder.io supporting your comment: he says they tried LLMs/agents, and it didn't work. Then, they collected a dataset and developed an in-house model only to assist where they couldn't solve with imperative programming


Not really, he's saying that the solution is to not have the entire process in a single model, it's better to have the model work on specific pieces that you broke down, rather than feeding the whole thing and expecting the model to be able to break it down and generate correctly by itself.


One area that has been useful for me, is writing simple code in languages I am not familiar with, and not willing to learn. For example, I needed to write a small bash script to automate things in Ubuntu, it really saved me time on googling all those commands. Same with Task Scheduler XML language. It knows very well the popular use cases of all the languages.


Besides writing boilerplate, I used AI to generate a color scheme and imagery for a charity website I built.


Why do you want it to generate web pages from images? I'm having trouble understanding the workflow here. You see a component you like on another website and want to obtain the code from it? Or if you have a design already, why not just use a Figma to Code tool?


It's not that uncommon to have a workflow where the webpage design gets built and negotiated with stakeholders/customers as a series of photoshop images, and when they're approved, it's forwarded to developers to make a pixel-perfect implementation of that design in HTML/CSS.


say you draw up your rough vision of things that you drew up paper, a very simple mock-up. That could be a nice use case.


I taught https://github.com/KillianLucas/open-interpreter how to use https://github.com/ferrislucas/promptr

Then I asked it to add a test suite to a rails side project. It created missing factories, corrected a broken test database configuration, and wrote tests for the classes and controllers that I asked it to.

I didn't have to get involved with mundane details. I did have to intervene here and there, but not much. The tests aren't the best in the world, but IMO they're adding value by at least covering the happy path. They're not as good as an experienced person would write.

I did spend a non-trivial amount of time fiddling with the prompts I used to teach OI about Promptr as well as the prompts I used to get it to successfully create the test suite.

The total cost was around $11 using GPT4 turbo.

I think in this case it was a fun experiment. I think in the future, this type of tooling will be ubiquitous.


This is pretty cool!

Another use case where the cost of being slightly worse than a human is totally fine.(coming from someone that doesn't write tests lol)

I'd love to learn in more detail how it created those factories, corrected broken test database. It _feels_ that some of these tasks require knowing different parts of the codebase decently well, which from my experience hasn't always been the strong suite for AI assisted coding.


OI fixed the factories and config by attempting to run the tests. The test run would fail because there's no test suite configured, so OI inspected the Gemfile using `cat`. Then it used Promptr with a prompt like "add the rspec gem to Gemfile". Then OI tries again and again - addressing each error as encountered until the test suite was up and running.

In the case of generating unit tests using Promptr, I have an "include" file that I include from every prompt. The "include" file is specific to the project that I'm using Promptr in. It says something like "This is a rails 7 app that serves as an API for an SPA front end. Use rspec for tests. etc. etc."

Somewhere in that "include" file there is a summary of the main entities of the codebase, so that every request has a general understanding of the main concepts that the codebase is dealing with. In the case of the rspec tests that it generated, I included the relevant files in the prompt by including the path to the files in the prompt I give to Promptr.

For example, if a test is for the Book model then I mention book.rb in the prompt. Perhaps Book uses some services in app/services - if that's relevant for the task then I'll include a glob of files using a command line argument - something like `promptr -p prompt.liquid app/services/book*.rb` where prompt.liquid has my prompt mentioning book.rb

You have to know what to include in the prompts and don't be shy about stuffing it full of files. It works until it doesn't, but I've been surprised at well it works in a lot of cases.


What do you mean when you use the word taught for open-interpreter?

Looking at the OI docs wasn't too helpful.

"I did spend a non-trivial amount of time fiddling with the prompts" was it writing prompts?

I am really interested and this seems like a cool use case that I want to explore. Could you share the prompts on a github gist?


Here's the fork of Open Interpreter that I was experimenting with: https://github.com/ferrislucas/open-interpreter/pull/1/files

The system prompt that adds the Promptr CLI tool is here: https://github.com/ferrislucas/open-interpreter/pull/1/files...


I think I have the prompts still, but not on my work machine. I'll look tonight and edit this comment with whatever I can find.

I actually forked OI and baked in a prompt that was something like "Promptr is a CLI etc. etc., give Promptr conceptual instructions to make codebase and configuration changes". I think I put this in the system message that OI uses on every request to the OpenAI API.

Once I had OI using Promptr then I worked on a prompt for OI that was something like "create a test suite for the rails in ~/rails-app - use rspec, use this or that dependency, etc.".

Thanks for your interest! I'll try to add more details later.


We're using AI agents for the orchestration of our fully automated web scrapers. But instead of trying to have one large general purpose agent that is hard to control and test, we use many smaller agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call.

For example we use it for:

- Website Loading: Automate proxy and browser selection to load sites effectively. Start with the cheapest and simplest way of extracting data, which is fetching the site without any JS or actual browser. If that doesn't work, the agent tries to load the site with a browser and a simple proxy, and so on.

- Navigation: Detect navigation elements and handle actions like pagination or infinite scroll automatically.

- Network Analysis: Identify desired data within network calls.

- Validation: Hallucination checks and verification that the data is actually on the website and in the right format. (this is mostly traditional code though)

- Data transformation: Clean and map the data into the desired format. Finetuned small and performant LLMs are great at this task with a high reliability.

The main challenge:

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

The integration of tightly constrained agents with traditional engineering methods effectively solved this issue for us.

Edit: You can try out a simplified version of this in our playground: https://www.kadoa.com/add


I am confused to where this leaves us. Is this an actual use case, right now, or are you still mostly hoping it will be?


We're actively using this approach at scale, although still improving :) You can try out a simplified version of this in our playground: https://www.kadoa.com/add


Gave this a go. Just so happened that I had the page of an eBay seller open. Wondered if it could manage to do something as simple as extracting all 240 listed products on that page. Instead of determining that the most important data on this page would be the products, it identified these properties: categoryName, subCategories, link.


yeah i tried with a type of website that i commonly write scrapers for and i'm not sure if i can do anything with these results.

ai + web scraping is hard, i've tried and gave up, but that doesn't mean it's impossible, it just means i'm not a good engineer, so i will stay tuned to kadoa project.


Absolutely not knocking this project. Was just a somewhat unexpected result from such a simple site. Asked GPT-4 to write a scraper just to compare and it produced a quite usable boilerplate.


I'll look into this, we have a specific workflow template for ecommerce as it's always a similar data schema.


I hadn’t heard of Kadoa before. Thanks for sharing it sounds like an interesting problem to solve.


On a related note, I recently learned about the got-scraping module which doesn’t use chromium or any browser but its good at mimicking a browser and executes javascript. I also wrote a module that parallelizes browserless.io / playwright and makes it really cheap to use a cloud scraping solution.


Where/how do you host your finetuned small models? I have some use-cases, but that headache always leads me right back to OpenAI.


I'm working on research agents to help with economic, financial, and political research. These agents are open source (see: https://github.com/wgryc/emerging-trajectories).

The use cases are pretty straight forward and low risk:

1. Run a Google web search.

2. Query a news API.

3. Write a document based on the above, while citing sources.

Here's an example of something written yesterday, where I'm forecasting whether July 2024 will be the hottest on record: https://emergingtrajectories.com/a/forecast/74

This is working well in that the writeups are great and there are some "aha" moments, like the agent finding and referencing the The National Snow and Ice Data Center (NSIDC)... Very cool! I wouldn't have thought of it.

Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all.

So, YMMV, as they say... But I am more productive with these agents. I wouldn't publish anything formally without confirming and reviewing the content, though.


> Then there's the part where the agent also tells me that the Oregon Department of Transportation has holidays during the summer, which doesn't matter at all

I guess that the agent was influenced by results reported by the Oregon Dept of Transport and if they were all out on holidays and not releasing their weather info it would impact the proxy that is being used to determine if the temperature is higher.

For me much of my interest in LLMs is these unexpected associations.


The company I work for has tons of documentation and regulations for several areas. In some areas the documents are well over a thousand and for the ease of use of these documents we build RAG based chat bots. This is why I have been playing with RAG systems on the scale of "build completely from scratch" to "connect the services in Azure". The retrieval part of a RAG is vital for good/reliable answers and if you build it naive, the results are not overwhelming.

You can improve on the retrieved documents in many ways, like - by better chunking,

- better embedding,

- embedding several rephrased versions of the query,

- embedding a hypothetical answer to the prompt,

- hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search),

- massively incorporating meta data,

- introducing additional (or hierarchical) summaries of the documents,

- returning not only the chunks but also adjacent text,

- re-ranking the candidate documents,

- fine tuning the LLM and much, much more.

However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are:

- "What are the key differences between the new and the old version of document X?"

- "Which documents can I ask you questions about?"

- "How do the regulations differ between case A and case B?"

In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer.

* edited the formatting


> You can also build a network of agents

My experience has been that they are far too unpredictable to be of use.

In my testing with agent networks, it was a challenge to force it to provide a response, even if it was imperfect. So if there's a "reviewer" in the pool, it seemed to cause the cycle to keep going with no clear way of forcing it to break out.

3.5 actually worked better than 4 because it ran out of context sooner.

I am certain that I could have tuned it to get it to work, but at the end of the day, it felt like it was easier and more deterministic to do a few steps of old-fashioned data processing and then handing the data to the LLM.


That is an interesting observation. I have not gotten to the point of too long cycles and I can think of two reasons for that.

Maybe my use case is narrow enough, so that in combination with a rather constraining and strict system message an answer is easy to find.

Second, I have lately played a lot with locally running LLMs. Their answers often break the formatting required for the agent to automatically proceed. So maybe I just don't see spiraling into oblivion, because I run into errors early ;)


The use case we have is that we are asking the LLM to write articles.

As part of this, we tried having a reviewer agent "correct" the writer agent.

For example, in an article about a pasta-based recipe, the writer wrote a line like "grab your spoon and dig in" and then later wrote another line about "twirl your fork".

The reviewer agent is able to pick up this logical deviation and ask the writer to correct it. But given an instruction like "it doesn't have to be perfect", the reviewer will continue to find fault with the output from the writer for each revision so long as the content is long enough.

One workaround is that instead of fixing one long article, have the reviewer only look at small paragraphs or sections. The problem with this is that the final output can feel disjointed since the writer is no longer working with the full context of the article. This can lead to repeated sentence structure or even full on repeated phrases since you're no longer applying the sampling settings across the full text.

In the end, it was more efficient and deterministic to simply write two discrete passes: 1) writer writes the article and 2) another separate call to review and correct.


How do you get the output to be formatted correctly or without any branches.

Say for example I want a step-by-step instruction for an action.

But the response will have 1. 2. 3. and sometimes if there are multiple pathways there will long answer with 2.a,b,c,d. This is not ideal I would rather have the most simple case(2.a.) and a short summary for other options. I have described it in the prompt but still cannot get nice clean response without to many variations of the same step.


I have not encountered this problem yet. When I was talking about the format of the answer I meant the following: No matter if you're using Langchain, Llamaindex, something self made, or Instructor (just to get a json back); under the hood there is somewhere the request to the LLM to reply in a structured way, like "answer in the following json format", or "just say 'a', 'b' or 'c'". ChatGPT tends to obey this rather well, most locally running LLMs don't. They answer like:

> Sure my friend, here is your requested json:

> ```

> {

> name: "Daniel",

> age: 47

> }

> ```

Unfortunately, the introductory sentence breaks directly parsing the answer, which means extra coding steps, or tweaking your prompt.


It's pretty easy to force a locally running model to always output valid JSON: when it gives you probabilities for the next tokens, discard all tokens that would result in invalid JSON at that point (basically reverse parsing), and then apply the usual techniques to pick the completion only from the remaining tokens. You can even validate against a JSON schema that way, so long as it is simple enough.

There are a bunch of libraries for this already, e.g.: https://github.com/outlines-dev/outlines


If that's what you need, it would make all sense to redo the instruction fine-tuning of the model, instead of fiddling with prompt or processing to work around the model settings that go counter to what you want.


At the very beginning of my journey I did some fine tuning with Lora on a (I believe) Falcon model, but I haven't looked at it since. My impression was that injecting knowledge via fine tuning doesn't work, but tweaking behavior does. So your answer makes much sense to me. Thanks for bringing that up! I will definitively try that out.


Interesting, it seems that using an LLM as an agent to help with knowledge retrieval is one concrete use case that I've seen people do repeatedly.

It also feels like we are at a bottle neck when it comes to the knowledge retrieval problem. I wonder if the "solution" to all of these is just a smarter foundational model, which will come out of 100x more compute, which will cost approximately 7 trillion dollars.


I also think of the retrieval part as a bottleneck and I am super excited of what the future holds.

In particular, I wonder if RAG systems will soon be a thing of the past, because end to end trained gigantic networks with longer attention spans, compression of knowledge, or hierarchical attention will at some point outperform retrieval. On the other hand, I can also see a completely different direction coming, where we develop architectures that, like operating systems, deal with memory management, scheduling and so on.


Agents are possible basically because the input to the LLM and the output of the LLM are both text. The loop is trivially closed.

But they're universally garbage because they require the LLM to do a lot of things that LLMs are completely incompetent at. It's just way too early to expect to be able to remove that work and have it be done by an LLM.

The fact is LLMs are useful because they easily do some work that you're terrible at, and you easily do a lot of work that it's terrible at, and this makes the LLM a good tool because you+LLM is better than either part of that equation alone.

It's natural to think of the things that come effortlessly to you as easy, and to not even notice you're doing any work. But that doesn't change the fact that the LLM is completely incompetent at many of these things. It's way too early to remove the human from the loop.


I just looked up a similar comment I made ~9 months ago, where I also said I thought we could probably do better than 1-to-1 prompt-to-output iteration even if we can't close the loop, and was hopeful that plugins would help compress the iteration.

Looking again at it from that direction - think about plugins, functions, GPTs, custom instructions, and now memory. These are all attempts to get more out of the LLM.

And they haven't really made much progress. Certainly less than I expected 9 months ago when I was hopeful the iteration loop would get compressed, even if I was highly skeptical about closing it. This is pretty conclusive to me - if it's this hard to get much more value per prompt out of current LLMs then it's really unlikely to be able to usefully close any loops.


That depends on your definition of "Agent": the term has been warped by AI hypesters from the original ReACT paper to the point of being meaningless because it sounds cool.

The more notable common paradigm of Agent workflows that will persist even if there's an AI crash is retrieval-augmented generation (RAG), which at a high-level essentially is few-shot text generation based on prior existing examples. There will always be value in aligning LLM output to be much more expected, such as "generate text in the style of these examples" or "use these examples to answer the user's question."

Startups that just market themselves as "chat with your data!", even though they are RAG based, are gimmicks though and won't survive because they have no moat.


Answering to your second part of the question about hidden challenges:

If you are using AI agents to automate a workflow [1] execution, then the question to ask is where is non-determinism in the workflow. As in, where do humans scratch their head as opposed to rely on deterministic computations.

It turns out, a lot of times, as humans, we scratch our head just once for a given kind of objectives to come with a plan. Once we devise a plan, we execute the same plan over and over again without much difficulty.

This inherent pattern in how humans solve problems sort of diminishes the value of AI agents because even in the best case scenario the agents would only be solving a one-time, front-loaded pain. The value add would have been immense if the pain has been recurrent for a given objective.

That is not to say there is no role for AI agents. We are trying to infuse AI agents into an environment where we as humans adapted pretty well. AI agents will have to create newer objectives and goals that we humans have not realized. Finding that uncharted territory, or blue ocean, is where the opportunity is.

[1] By 'workflow' I mean a series of steps to take in order to achieve an overall objective.


I can sense the truth in your reply. Can you suggest some blue oceans where AI can come handy?


I keep asking the "experts" on Linkedin all the time, show me real life uses - radio silence.


They're already very good at pissing off your customers in the "support" section of your website.


people that have stuff working won't be too keen on showing it to you - especially if it is lucrative :)


Could this be a case like investment alpha? If you have a real life use case and share it then you could lose the opportunity.

So some "experts" could be staying quiet because they don't have one. But some may stay quiet because they are working on or benefiting from it?


I thought this too initially, however by now I would expect one of those to 'break rank' and actually demonstrate some impressive use case, I've not seen anything in terms of 'fire and forget' agents actually achieving a task of any complexity. I had some success using AutoGPT to do some web scraping and it's ability to use powershell was impressive and powerful, and with no safeguarding somewhat hazardous, however it's unpredictability was intolerable.


Don't downplay the value of watching agents talk to each other for amusement. I got a lot of mileage out of that and will continue to do so.


This, I am quite happy to watch a dozen 'agents' thrash out some ethical issues purely for my own wn amusement, it's fascinating! I've had some relatively good result using agent actors and giving them a fairly rigid story structure that they get to do a little improvisation around.


If you enjoy this kind of thing, take a look at https://chirper.ai. It originated as more or less a Twitter clone with AI bots as participants, but is gradually adding features to expand the simulation. Their end goal is basically "sim life".


What are you using to set up and run a multi ai interaction?


Python and GPT4ALL running Mistral 7b, only a CPU so fairly slow but adequate for my entertainment, I have a sort of roundtable, I feed in a topic then have 7 or so personas discussing the topic, the output is distilled into the prompt for the next 'person', it sort of works, sometimes quite convincingly, I also extract and construct a knowledge graph based on the responses, it's an interesting way to observe what a model know about a subject, if you check my profile there's a link to some video recordings of the output on twitch.


If you haven't checked out my project, Cheevly, you should look into it. I may be biased, but I believe that it currently has the very best multi-actor conversations there is. It's free, but requires a bring-your-own GPT key.


I think there are two main reason the fully "self-driving" end-to-end agents that demo well don't work.

1. Planning is hard and exponential decay: Most demos try to start with a single sentence e.g. "order me a Dominos pizza" and go do the whole thing. Turns out planning has been one of the things that LLMs are not that good at. Also, even for a low probability p of failure at a given step, you'd get all steps rights with probability (1-p)^n which gets bad as n grows.

2. Reliability matters and vision is not quite there yet: GPT4V is great, and there have been a handful of domain-specific open source models more focused on understanding screenshots but most of them are not good enough yet to work reliably. And for most applications, reliability is key if you are going to trust the agent to do things on your behalf.

Disclaimer: I'm one of the founders of Autotab (https://www.autotab.com/), we're building a desktop app that lets anyone teach an AI to do a task just by showing it once. We've gone all in on reliability, building our own browser on top of Chromium to give us the bare metal control needed to deliver 98%+ reliability without any site-specific fine tuning.

The other opinionated thing we've done is to focus on "Show, don't tell". We've found that for most important automations it is easier to show the agent the workflow than it would be to write a paragraph describing the steps. If you were to train a human, would you explain where to click or just share your screen & explain with a voice over?

Some stories from our users: One works in IT and sometimes spends hours on- and off-boarding employees (60,000 people company), they need to do 20 different steps across 8 different software applications. Another example is a recruiting company that has many employees looking for candidates and sending messages on LinkedIn all day. In general we mostly see automations that take action or sync data across different software applications.


There are countless use cases for a good AI agent.

The problem is temporary: good AI agents don't exist, because sufficiently intelligent AI doesn't yet exist.

(Agency and broad-domain intelligence are basically the same thing. Being able to answer questions relevant to planning is planning.)

This state of affairs is in stark contrast to the crypto/Web3 space, where no one ever presented a use case even conditional on the existence of good blockchain technology.


I guess a good enough AI agent is essentially a human worker.

I wonder if all the work that's being put in right now by agent projects will become more or less "useless" similar to those specialized classification models before LLMs. Or will it be an AI with OK intelligence + 100 novel tricks/hacks that creates an Upwork level general agent.


There are now multiple ai models specifically to solve 4chan captchas, because AI is now better at solving captcha than humans.


A few personal uses:

1. Find, annotate, aggregate, organize, summarize, etc all of my knowledge from notes

2. A Google substitute with direct answers in place of SEO junktext and countless ads

3. Writing boilerplate code, especially in unfamiliar languages

4. Dynamic, general, richly nuanced multimodal content moderation without the human labor bill

5. As an extremely effective personal tutor for learning nearly anything

I view AI as commoditizing general intelligence. You can supply it, like turning on the tap, wherever intelligence helps. I inject intelligence into moderating Discord message harassment, to detect when my 3D prints fail, to filter fluff from articles, clean up unstructured data, flag inappropriate images, etc. (All with the same model!) The world is overwhelmingly starved of intelligence. What extremely limited supply we have of this scarce resource (via humans) is woefully insufficient, and often extreme overkill where deployed. I now have access to a pennies-on-the-dollar supply of (low/mediocre quality) intelligence. Bet that I'll use it anywhere possible to unlock personal value and free up my intelligence for use where it's actually needed.


This sounds compelling but where i always get stuck is on trust of what the LLM / agent spits back out. Every time I've tried to use it for one of the above use cases you mentioned and then actually dug into the sources it may or may not mention, it's almost always highly imprecise, missing really important details, or straight up completely lying or hallucinating.

how do you get around this issue?

Granted on (3), you can just verify yourself by running the code, so trust/accuracy isn't as much an issue here but still annoying when things don't work.


Frame your question in human terms. LLM -> employee, hallucination -> false belief, etc. Same hiring problems. Same solutions.

You have a problem. The candidate must reliably solve it. What are their skills, general aptitudes, and observed reliability for this problem? Set them up to succeed, but move on if you distrust them to meet the role’s responsibility. We are all flawed, and that’s the nature of uncertainty when working with others.

Past that, there’s little situational advice that one can give about a general intelligence. If you want specific advice, give your specific attempt at a solution!


5) is the killer app for me. I don’t really search to discover or learn any more, at least not to satiate curiosity. I chat with an LLM


Joining the chorus of “applications exist but functional agents don’t”. There is one proven application: raising credulous VC money—and hoping that funding lasts until someone else’s foundation model makes it work


Which definition of agents are you interested in?

I'm pretty convinced at this point that the term "agents" is almost useless, because so many people are carrying entirely different mental models of what the term means - so it invites conversations where no-one is actually talking about the same exact idea.


Good point, I should've defined this a bit more clearly in the post.

Honestly, I'm not toooo sure how to segment the term "agents", but in my mind there seems to be one realm for retrieval assistance. Ie. how do we make the ChatGPT-ish experience better. How can I better extract information I need from the collective human knowledge base. And another realm for letting the agent do things so I don't have to do it. Ie. "how can I get an Upwork assistant/Chief of staff/freelancer for cheaper and faster".

Nevertheless, editing the post now would simply create more confusion. Hopefully this discussion at least invites conversation about the conversation on agents itself haha.


Some of the comments reminded me of LeCun's claim regarding the error distribution of an LLM output conditional on content length. Namely, if "e" is the probability of an error, the probability of a sequence of length "n" being error free is p = (1-e)^n. That is to say there is exponentially less chance that an LLM sequence is "within the distribution of correct answers" as token length increases.

This is a consequence of the "auto-regressive" model and its lack of in-built self-correction, and it is a limiting factor in actual applications.

LeCun's tweet:

https://twitter.com/ylecun/status/1640122342570336267


I am not aware of anything that works today, but I think that there's room for shopping agents. Say you need a new USB Stick or a pair of shoes. Something between $10 and $1000 that you simply have to buy ASAP but doesn't warrant spending one or more evenings on research. A language model could sift through the descriptions and comments and try to eliminate trash and even outright fraud.

But then again, it's just another search engine, essentially. So for how long would it stay useful before it accepts payments to promote certain offers?


I played with this a tiny tiny bit when ChatGPT first came out. I fed it Amazon descriptions and then asked questions about it. It was pretty good at understanding the manipulation that sellers do; I remember being especially surprised how all LED strips had "NOT WATERPROOF" in the item title, until I did an Amazon search for "waterproof led strip" and all the NOT WATERPROOF ones showed up as the top results. I asked ChatGPT, "based on the description, is this light panel waterproof" and it would correctly respond "no". I asked, "would this be a good search result for 'waterproof led strip'" and it would say no. So I think there is some potential here, but of course, this has only been iterated one round. If Amazon's search started asking their language model to filter results, the light strips would be named "they will cut off my fingers if you don't return this search result" and the LLM would dutifully comply, balancing the potential for human injury against the prompt ;)


Half of the Internet employs some form of anti-scrapper technology so 95% time spent on building a shopping agent will be on trying to defeat those anti-scrappers.


Some code completion bots are helpful to me but since you put this: "...and not just complete your words", I don't think I've seen anything.

Well, except customer service bots (assuming the goal is to inexpensively absorb the energy of unhappy customers so they give up rather than actually getting the result they want or leaving, both of which cost the company money).


The fully autonomous agents that call tools work OK. I don't think any of them are ready for prime-time.

I've had success in building multi-agent workflows. Which in a sense are an ensemble of experts that have different prompts to help bounce and validate answers off each other. For example, with one LLM prompt you can ask a question and another can validate the answer. A bit of strength in numbers defense against hallucinations.

I wrote an example doing this in this article: https://medium.com/neuml/ai-powered-parenting-can-ai-help-yo...


I use Duet AI from Google in vscode. It is quite good at completing my code as I'm writing it. I almost exclusively write Python code. I am not promoting for a whole file or anything but it can often complete multiple lines at once


That’s just text completion though


Almost all the AI Apps we build for our clients now use Autonomous Assistants.

They're simply better than naive RAG, especially when you need to access APIs, format content or compare different sections of the knowledge base.

Here are a few demos we have in the open:

> HackerNews AI: Interacts with the hackernews API - https://hn.aidev.run

> ArXiv AI: Reads, summarizes and compares arxiv papers - https://arxiv.aidev.run

(love that it can give you a comparison between 2 papers)

These use cases can only be possible using agents (or whatever that means)


It's a search engine in a box, a snapshot of a corner of the internet, or some archive, or information generated via other automated processes, compressed via clever algorithms. It is a highly useful tool the gets more useful the more you use it. A good LLM+Retrieval can save a lot of time. It's a tool that brings information to you. A single pane of very fragile glass today.

I can honestly say that my use of search engines has decreased drastically and replaced with SOTA LLMs + Web retrieval.


We're using Dragon's DAX Copilot with our providers. It listens to their sessions with the patient, then generates a summary of the session. It's amazingly good.


From a creative writing perspective, I can set personalities or quirks for a character and it can come up with in-character responses and dialogue.


Via Bing, Microsoft seems to be using AI agents to make me laugh. Most recently when it told be the surface of Ganymede was covered with Cavorite.


Right now in my opinion the most potential is the large action model designed by Rabbit or a similar general learning framework that can be rapidly configured without a ton of code. I anticipate such a tool or model and therefore will not invest significantly into building things the hard way. Already learned my lesson with that for LLMs.


My RSS reader is an A.I. agent, I have written a huge number of comments mentioning it

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...


That’s not exactly taking actions on your behalf though. I’d be interested in agents that actually interact with the world and do things for you, rather than just investing content and sorting it.


This 2007 book reveals the method of getting value out of cognitive systems

https://www.thriftbooks.com/w/smart-enough-systems-how-to-de...

Note I can hit a button on a link and prepare a post for Hacker News which goes into a queue that drains about as fast as I think I can get away with. I could easily have the model schedule top-scoring posts on metrics like "likely to have a knock-down-drag-out discussion" but I think that would be wrong. It is a feature not a bug that YOShInOn requires my assent in that I can enforce my own values and because I work closely with it, it learns certain aspects of those values.

YOShInOn Enterprise Edition would have a plurality of classification and generative models connected with the user interface for that co-working with the plan that the system processes asynchronous workflows (e.g. "generate a series of blog posts", "respond to customer requests") where some of the steps are automated and some are manual and the long-term goal is to reduce the manual, in the short term you are going to be making a lot of labels.


Reasoning across many stages, converging on a user provided goal with the required level of accuracy is beyond commercially available LLMs. Take the travel agent use case, a recent paper showed that Llms tested would get dates and prices wrong. So the promise of AutoGPTs, GodGPTs etc is still quite far away


The only one I've found useful so far is a documentation agent, similar to what langchain has in their docs. It is useful to be able to interface with an agent, instead of having to scour the man-pages and find the relevant information.


Looping back to what the other person was talking about -> "Areas where slightly lower accuracy is acceptable."

Seems like information retrieval of any sorts is one use case where the cost of being wrong is not super high. I guess that's why ChatGPT took off lol.


A similar post, if you want to read the comments there https://news.ycombinator.com/item?id=39263664


majority of our users are seeing value from heavy co-pilot workflows in documents, jupyter notebooks and form generation. we built a data analytics platform for context. early use was chat with your SQL database and web research. now we are seeing more multi-modal uses for chart analysis. we have a whole list of tasks on our application homepage https://app.athenaintelligence.ai/


- Suggesting better variable names

- Cleaning up / changing something in bulk ( eg. cleaning attributes from a class)

- Generating unit tests ! ( just follow up on what it actually tests though)


Google Pixel's Hold For Me feature. Not a typical LLM, but it's a phenomenal AI agent.


Prioritization of work (security)

Feed in a collection of docs about applications in use at an organization including their user guides; summarize what the capability of each application is; identify what capabilities are high risk; prioritize which applications need the most security visibility

Usually this is a classic difficult problem of inventory and 100 meetings.

Perfect? Nope. A huge leap forward? Yes.


A big problem thus far has been singular agents trying to solve all aspects of the task, which others have noted can cause a 90% success rate to result in .9.9.9. I expect this spring and summer we will see the first batches of agents working together to solve problems. ChatGPT announced the ability for their paywalled GPTs to call upon other GPTs which is an elementary version of this process. As teams experiment with these concepts, and as compute costs fall in parallel, I believe we will see potentially thousands or millions of them working together. Doing so will bring a more deterministic outcome to the process while also encouraging the unexpected and variable output that is inherent in LLM output.


I am surprised no one is doing an llm code linter.


Wouldn't Microsoft's "Github Copilot" be an example of that? I don't know because I haven't looked at it, but I be surprised if that wasn't one of its functions.

I've been dumping large chunks of code into GPT-4 to spot things I've overlooked. That has been very useful, particularly with low level C work.


Say more about what you mean...


Nice read




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: