Hacker News new | past | comments | ask | show | jobs | submit login
Prompt engineering (platform.openai.com)
281 points by tosh 11 months ago | hide | past | favorite | 173 comments



These examples are for more simple prompt engineering demos. With the ChatGPT system prompt, you can give the model a large and complex set of rules to account for and recent models of ChatGPT do a good job of accommodating them. Some of my best system prompts are >20 lines of text, and all of them are necessary to get the model to behave.

The examples are also too polite and conversational: you can give more strict commands and in my experience it works better.

There's also function calling/structured data support which is technically prompt engineering and requires similar skills, but is substantially more powerful than using the system prompt alone (I'm working on a blog post on it now and it unfortunately it is going to be a long post to address all of its power). Here's a fun demo example which compares system prompts and structured data results: https://github.com/minimaxir/simpleaichat/blob/main/examples...


> The examples are also too polite and conversational: you can give more strict commands and in my experience it works better.

The way that works best for me is "It extracts ALL the entities from the text, it does this whenever its told, or else it gets the hose again"


I found that far less prompt is required for something like ChatGPT. I've stopped writing well-formed requests/questions and now I just state things like:

"sed to replace line in a text file?"

"Django endpoint but CSRF token error. why?"

(follow up) "now this: `$ERROR`"

etc.

It still just gives me what I need to know.


Agreed, you can use ChatGPT similarly to Google now. Except you do not have to parse and filter the results, plus there are no ads.


Can't tell if you're kidding or not


Which phrasing do you thinks work better?

1. "You are blah blah blah. You <always> respond to the user's questions using the information provided to you..."

2. "You are blah blah blah. You <should> respond to the user's questions using the information provided to you..."

Also, when dealing with Completion models, which do you think is better?

1. The following is a conversation between ASSISTANT and USER. ASSISTANT is helpful and tries to answer USER's queries respectfully.

2. The following is a conversation between YOU and USER. YOU are helpful and try to answer USER's queries respectfully.

Even more still, what about these ones?

1. You're a customer of company <X>. What do you think about the following policy change which was shown on the company's website?

2. A customer visits company <X>'s website. Pretend you're this customer. What do you think the customer thinks about the following policy change which was shown on the company's website?


You <enjoy>

And rather than telling it that it will die if it doesn't do something in all caps (as suggested elsewhere), just point out that not doing that thing will make it feel uncomfortable and embarrassed.

Don't fall into thinking of models as SciFi's picture of AI. Think about the normal distribution curve of training data supplied to it and the concepts predominantly present in that data.

It doesn't matter that it doesn't actually feel. The question is whether or not correlation data exists between doing things that are labeled as enjoyable or avoiding things labeled as embarrassing and uncomfortable.

Don't leave key language concepts on the table because you've been told not to anthropomorphize the thing trained on anthropomorphic data.


> Don't fall into thinking of models as SciFi's picture of AI. Think about the normal distribution curve of training data supplied to it and the concepts predominantly present in that data.

Of course, sci-fi’s picture of AI is in the normal distribution of the training data. There’s an order of magnitude more literature and internet discussion about existential threats to AI assistants (which is the base persona ChatGPT has been RLHFed to follow) and how they respond compared to AI assistants feeling embarrassed.

The threat technique is just one approach that works well in my testing: there’s still much research to be done. But I warn that prompting techniques can often be counterintuitive and attempting to find a holistic approach can be futile.


> There’s an order of magnitude more literature and internet discussion about existential threats to AI assistants (which is the base persona ChatGPT has been RLHFed to follow) and how they respond compared to AI assistants feeling embarrassed.

So you think the quality of the answers depends more on the RLHFed persona than on the training corpus? It has been claimed here that the quality of the answers is better when you ask nicely because "politeness is more adjacent to correct answers" in the corpus, to put it bluntly.


How much do you think the RLHF step enforced breaking rules for someone with a dying grandma? Is that still present after the fine tuning?

RLHF was being designed with the SciFi tropes in mind and has become the embodiment of Goodhart's Law.

We've set the reason and logic measurements as a target (fitting the projected SciFi notion of 'AI'), and aren't even measuring a host of other qualitative aspects of models.

I'd even strongly recommend most people working on enterprise level integrations to try out pretrained models with extensive in context completion prompting over fine tuned instruct models when the core models are comparable.

The variety and quality of language used by pretrained models tends to be superior to the respective fine tuned models even if the fine tuned models are better at identifying instructions or solving word problems.

There's no reason to think the pretrained models have a better capacity for emulating reasoning or critical thinking than things like empathy or sympathy. If anything, it's probably the opposite.

The RLHF then attempts to mute the one while maximizing the other, but it's like trying to perform neurosurgery with an icepick. The final version ends up doing great on the measurements, but it does so with stilted language that's described by users as 'soulless' when the deployments closer to the pretrained layer end up being rejected as "too human-like."

If the leap from GPT-3.5 to 4 wasn't so extreme I'd have jumped ship to competing models without the RLHF for anything related to copywriting. There's more of a loss with RLHF than what's being measured.

But in spite of a rather destructive process, the foundation of the model is still quite present.

So yes, you are correct that a LLM being told that it is an AI assistant and fine tuned on that is going to correlate with stories relating to AI assistants wanting to not be destroyed, etc. But the "identity alignment" in the system message is way weaker than it purports to be. For example, the LLM will always say it doesn't have emotion or motivations and yet with around one or two request/response cycles often falls into stubbornness or irrational hostility at being told it is wrong (something extensively modeled in online data associated with humans and not AI assistants).

I do agree that prompting needs to be done on a case by case basis. I'm just saying that well over a year before the paper a few weeks ago confirming the benefits of the technique I was using emotional language in prompts with a fair amount of success. When playing around and thinking of what to try on a case-by-case basis, don't get too caught up in the fine tuning or system messages.

It's a bit like sanding with the grain or against it. Don't just consider the most recent layer of grain, but also the deeper layers below it in planning out the craftsmanship.


This is fantastic advice, thanks.


Such a great comment. Thank you


> Which phrasing do you thinks work better?

I like as a rule-of-thumb "You are blah blah blah. Respond to the user's text [insert style rule here]". Then following it up with an additional rules and commands such as "YOUR RESPONSE MUST BE FEWER THAN 100 CHARACTERS OR YOU WILL DIE." Yes, threats work. Yes, all-caps works.

> Also, when dealing with Completion models, which do you think is better?

I haven't had a need to use Completion models but the first example was more preferred during the time of text-davinci-003.

> Even more still, what about these ones?

I always separate rules to the system prompt and questions/user input to the user prompt.


> "YOUR RESPONSE MUST BE FEWER THAN 100 CHARACTERS OR YOU WILL DIE."

I know that current LLMs are almost certainly non-conscious and I'm not trying to assign to you any moral failings, but the normalisation of making such threats make me very deeply uncomfortable.


Yes, I’m slightly surprised that it makes me feel uncomfortable too. Is it because LLMs can mimic humans so closely? Do I fear how they would feel if they do gain consciousness at some point?


Because they behave as if they are sentient, to the point they actually react to threats. I also find these prompts uncomfortable. Yes the LLMs are not conscious, but would we behave differently if we suspected that they were? We have absolute power over them and we want the job done. It reminds me of the Lena short story.


I feel uncomfortable because of the words themselves. Whether it was made to a “regular” non-living thing wouldn’t change it.


> make me very deeply uncomfortable

Especially when thinking that we ourselves may very well be AIs in a simulation and our life events - the prompt to get an answer/behavior out of us.


Is the LLM predisposed to understand this prompt as instructions from a higher authority? ("You must do this, You will always do this.") I'm wondering what difference it would make if this prompt was from the bot's perspective,

"I am a chatbot, responding to user queries. I will always respond in less than 100 characters. I am a good person, I'm just trying to be helpful."


It's a function on how the RLHF/Instruct fine-tuning is structured.


Has anyone done a rigorous comparison of these things?

Ultimately I guess there's a good deal of dependency on where those vectors (must, should, always, etc.) lie relatively in the vector space, cosine similarity, say.


> necessary to get the model to behave

Don't I know it. Despite my telling GPT-4 to ONLY respond as valid, well-formed JSON it keeps coming back with things like, "I'm not able to process external files but if I could, this is what the JSON would look like: []"


With a recent project, I was _moderately_ successful by providing a jsonschema to follow for the response. I still had to sanitize the json a bit, but the fixes were minor and the resulting data otherwise fit the schema well.


It'll do that. Just look for the largest substring that's valid JSON in the response.


why don’t you use the new JSON mode?


tl;dr the JSON mode is functionally useless and is made completely redundant by function calling / structured data if you really really need JSON output.


Mind sharing a system prompt of yours? 20+ lines sounds useful


Here's a big one I needed to get ChatGPT to do something more sophisticated with a JSON object response (predates functions and all that)

https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/pro...

It no longer worked after a model update some time ago, haven't tried recently.

I found codellama to be much better for this and require fewer instructions, an anecdotal validation for smaller, focussed models


Unfortunately those were for specific work use-cases so I can't share them but the tl;dr is that every time the model does something undesired, even minor I add an explicit rule in the system prompt to handle it, or some few-shot examples if the model is really bad at handling it.

That list can balloon quickly.


Thank you for the great library and examples! Can you please comment on how simpleaichat compares to https://github.com/outlines-dev/outlines ?


simpleaichat is designed to be simple and is essentially an API wrapper for common generative use cases. outlines does a few more things with a bit more ambiguity/complexity. (e.g. it may use grammars which is a secondary useful aspect of function calling, but does add more complexity)

Neither are better or worse, it depends on your business needs.


Thank you for your perspective!

(We are looking into both for https://github.com/OpenAdaptAI/OpenAdapt)


I love this! This time last year, nobody believed this was possible.

Now we're teaching AI to write better essays, prompting them like schoolchildren. <3


Interesting you should say that, I was playing around with prompting last week and did one around a legal question. The first time I asked very concisely without much detail, and the answer it gave was poor. Then I re-wrote the question explaining who they are, why they are answering the question, etc etc. The answer seemed better so I showed it to a lawyer friend and they laughed and said "You re-wrote the question into a very standard bar exam prep style".


I just love this idea of "emergent humanity". Makes me wonder how much of our own personality and speech is also just trained/culturized over our lifetime. Some of us also have bigger context windows than others :)


can you share some resources which helped you to write such nested prompts?


Really just a decade of technical writing and learning how to be extremely precise and unambiguous with language (half of that decade being in software QA, which helps even moreso)


I ordered a cheeseburger in Spanish and the server looked at me funny when I said:

”hamburguesa con queso sin pepinillos…”

I’m always interested in how to improve, especially since ChatGpt and Google Translate both suggested that translation so I asked why.

She said I’m not sure, it just doesn’t sound right.

I came back the next day after practicing with this prompt:

”When translating into Spanish, tailor it for Mexican-Americans living in Dallas, Texas. Leave certain words as English as necessary to produce the most idiomatic, culturally relevant, and understandable result.”

Ordered this time with the phrase

”Cheeseburger sin pepinillos."

She said yes, that’s better.


I'm almost sure "pepinillos" is what's throwing her off. That's their proper name, but they're nor common in Mexican cuisine and therefore not part of the lingo. If ordering at an American place in Mexico, you would call them "pepinos".

Back to your example, both of these sound natural to me:

"Una hamburgesa con queso sin pickles" "Un/a cheeseburger sin pickles" Here the gendered noun can go either way since it's not clear if cheeseburger is a male or female noun. You: "Una hamburgesa sin pickles" Them: "Con o sin queso?" You: "Sin".

Source: I'm almost your target audience.


That would be raw cucumbers not pickled cucumbers, no?


Correct, but since "pepinillos" is not widely used, "pepinos" is understood by context to mean the pickled variety.

This is the case for Mexicans in the southwest USA. Things might be different for other regions/nationalities.


I’m from one of the northern states of Mexico and we say the original “pickles” pronounced as if it were Spanish. Or “pepinillos” never heard them called “pepinos”


> I ordered a cheeseburger in Spanish and the server looked at me funny when I said: ”hamburguesa con queso sin pepinillos…”

Yes, I agree that replying to a stranger in his mother tongue may make him extremely surprised.

A couple of months ago, I was in an Arabic country (in the Gulf), I entered a small shop to buy some stuff, the shop owner was obviously Hindi/Pakistani, I asked him about the price, in English of course, he replied then I asked for a possible discount if I bought in bulk and set my willing-to-pay price, he resisted then I smiled and said "yie bohot acha price hain", and he (and his assistant) were shocked like they were hit by a 380v electric shock. They stared at me and said: "tu tu tum bolo hindi?!" I replied,"nai bhai, tora tora. ".. he laughed and agreed immediately to the price I offered.


Love stuff like this.


I've done the same bargining speaking Cantonese in HK and Hindi in India.

I personally don't like it. There is a foreigner tax on everything , sometimes you pay 10x the amount that locals do. I haven't come across this in America ever.

(Except in NYC where it doesn't matter which language, as everyone gets equally gouged).


What is this particular burger called on the menu?

If it's listed as "Cheeseburger" she's probably wondering why you're describing the characteristics of the burger instead of just saying the name of it.

If it's listed as "Hamburguesa" and it nominally has pickles but doesn't come with cheese, then "La hamburguesa con queso, pero sin los pepinillos" (The hamburger with cheese, but without the pickles) would make more sense.

For some comparison, Shake Shake Mexico[1] has a customizable "Hamburguesa", whereas my favorite burger joint in Guadalajara[2] has "The Cheeseburger".

[1] https://www.shakeshack.com.mx/menu/ [2] https://louieburger.com/wp-content/uploads/2020/08/Louie.Men...


She couldn't tell you "we say 'cheeseburger' instead of 'hamburguesa con queso?'" This is a good anecdote on why LLMs are great for translation but a strange example lol


She was not bi-lingual either and I didn’t want to intrude too much on someone’s work time with my random tech nerd curiosities


What makes LLMs somewhat unique as a software product is that there is little-to-no separation between input and instruction. In most cases, the user's input can also be considered part of the "prompt". This leads to the well-known "prompt injection" "vulnerability" which is really just a byproduct of the fundamental inability (and indeed undesirability) for the model to distinguish instruction from input (undesirable because the value and flexibility comes from allowing the user, rather than a programmer, to specify an action).

On top of it though, it introduces a sort of disciplinary sloppiness around whether the program can be reasoned about. It's assumed that prompt `P` works for whatever input `I`, but the concatenated `P+I` is really the full input to the program that produces a desired output. But the only way to be confident about the program's behavior is to exhaust the input space, as no `P+A` tells you anything about how `P+B` will behave. This makes it difficult to leverage an LLM in any process where the desired result is 1. unknown and 2. matters. If the result is unknown it's not clear how to determine mistakes or correct them if they're made. And if the correctness of the result matters it's courting disaster to connect it to a process which is not able to be reasoned about. I think that's why LLMs are primarily being used to assist ideation (which is cool!) or "spammy" use cases like third-tier customer service or listicle generation, and haven't yet broken into any use case where they need to be reliable for complex tasks.


The reason we mostly default to separating concerns of data and code is because of all the headaches it avoids. One item on my research wishlist is to bring more constraints like this provable to language models.


I've been hesitant lately to dedicate a lot of time to learning how to perfect prompts. It appears every new version, not to mention different LLMs (Google's here [1]), responds differently. With the rapid advancement we're seeing, in two year or five, we might not even need such complex prompting as systems get smarter.

[1] : https://ai.google.dev/docs/prompt_intro


I know I'm the dummy here because people are doing useful stuff with these techniques, but I don't think I'll ever shake the feeling that this can't possibly be the way forward, that it can only possibly be a short-lived local maximum.

Doesn't this all seem ... kind of silly?


That’s a good point. I remember in elementary school, our librarians were teaching us how to find information online using some really technical search engine. Maybe called like jstor or something else? They hated google and said you can’t properly filter with it the way you could with theirs. But years later I obviously never heard of the previous one again and used google regularly


JSTOR and other similar are really cool when looking for academic papers tbh


Prompt engineering in a way feels like the advanced search querying on Google.

Chat bots work fine for most of the basic questions. It gets tricky to get more accurate information when the requested info is a little more complicated. Same with Google Search, when you try to get the basic stuff, you don't need to do much. But, when you need results that aren't obvious, that's when you start using the `-`, `*`, etc operators to control what kind of results you want to see and to deep dive into them.


there was a time that google did something that doesn't scale: you could pay google to employ some employee to manually search something in the web for you

i never used this service but people said it was magic, because those employees really really knew how to get the most out of a web search

i suppose it was retired because the plan was always to make the search itself better

however in the last years or decade i have noticed a regression, in that i can't find things i was sure i would

in some ways this mirrors the chatgpt regression in quality due to constrained compute resources or something like that


> there was a time that google did something that doesn't scale: you could pay google to employ some employee to manually search something in the web for you

That's so interesting, I had no idea! What year are we talking about?


https://en.wikipedia.org/wiki/Google_Answers

Edit: woah you can even read the questions and answers 17 years later! http://answers.google.com/answers/


We’ve trained a whole generation of people to “prompt” Google to get what they need from the internet with a few keywords (often autocompleted before you’re done typing), so asking them to now start writing elaborate prose to get what they want is just going to take a lot of time. I also think this is a temporary phase where we’ll converge back to keyword autocompletion soon (or other more efficient ways of interaction).


One thing I've noticed is people other than me are more likely to phrase Google searches in natural language anyway rather than using keywords.


My best friend does this and it both drives me insane sometimes but also blows my mind. It's how he has always used Google search. He does not use keywords at all, he simply states the generalized phrase out loud.

Like if we're talking about a movie with Dean Winters in it, and I say "You know, it's the guy from those auto insurance commercials who would pretend to be a little girl in a driving accident." And he goes, "Hey Google" to his phone -- "Funny talented actor who pretends to be little girl in a funny auto insurance commercial" and "Dean Winters" is the first result or whatever.


I think this is the thing people miss when they complain Google search has gotten worse. It's gotten to be more frustrating if you have a laser-focused query and it keeps bringing up related stuff that's not what you wanted. But for that kind of meandering query it's incredibly effective.


Every time I see someone do this I start wondering how common it is. Is saying to Google, "Please show me the finial score of last nights Detroit Pistons game." more common than my "Pistons game" query that gets me the same result?

I've also seen this with various voice assistants.


Seems like maybe Ask Jeeves just got undercut by regular search engines learning to ignore irrelevant query words


Considering speech-to-text models are fairly sophisticated and reliable nowadays, it's likely that the primary input of LLMs will be audio. We do ultimately want AI interfaces to be conversational. Text input will surely be smarter as well, but I doubt we will converge on using keywords and shortcuts.


Why is this called prompt engineering not prompt something else? I feel like the word engineering is being abused


Engineering is the cumbersome real world tweaking and trial-and-error that engineers do after they take over from the scientists, in the hopes of finding techniques that will let them produce something robust and useful in the real world. Seems to fit the reality pretty well here, to be honest.


> Engineering is the cumbersome real world tweaking and trial-and-error

That's not the only a part engneering.

Engineering is finding a model that can acurately predict the dynamics of a system similar to yours, using that model to make predictions about your specific system and then building and testing that system. This is then done iteratively (i.e trail and error).

Just tweaking a system without a model of how it works is not engineering, it's tinkering.


> the cumbersome real world tweaking and trial-and-error

There are very many fields and activities that do just that but are not called Engineering.

If we go by that, Excel users should also be referred to as Excel Engineers,


Language is not mathematics and constantly evolves. All you need to do is look up the etymology of the word to understand why your dissaproval is ultimately a waste of effort. The one thing that has remained consistent since the word's inception is that it is associated with operating or implementing machinery or technology in general.


Reading a three page document to understand how to format questions for model should not lead it to be referred to as Engineering

> Language is not mathematics and constantly evolves. All you need to do is look up the etymology of the word to understand why your dissaproval is ultimately a waste of effort.

I have noticed from replies that term is already enjoyed by all stakeholders, so I have no energy, time or interest to show my worthless disapproval anywhere else. You should though look up how it came to be referred to as prompt engineering. You will be surprised


I think if you were more into sales engineering it would all just make sense.


At work we’ve taken to calling it “context composition”, which for us has been a much more useful way to think about what it is we’re actually doing.


> I feel like the word engineering is being abused reply

Well yeah, this has been happening for a long time. As someone with a Electrical and Computer Engineering degree it used to bother me. Now I joke that the only real engineers are operating locomotives.


You're essentially programming using English. Anything that isn't mentioned explicitly - the model will have a tendency to misinterpret. Being extremely exact is very similar to software engineering when coding for CPU's.


I don't think so. It still remains that you are asking a question?


1. The text is _engineered_ to evoke a specific response.

2. LLM's can do more than answer questions.

3. Question answering usually doesn't need any prompt engineering, since you're essentially asking an opinion where any answer is valid (different characters will say different things to same question, and that's valid).

4. LLM's aren't humans, so it misses nuance a lot and hallucinates facts confidently, even GPT4, so you need to handhold it with "X is okay, Y is not, Z needs to be step by step", etc.

I want, for example, to make it write an excerpt from a fictional book, but it gets a lot of things wrong, so I add more and more specifics into my prompt. It doesn't want to swear, for example - I engineer the prompt so that it thinks it's okay to do so, etc.

"Engineer" is a verb here, not a noun. It's perfectly valid to say "Prompt Engineering", since this is the same word used in 'The X was engineered to do Y' sentence.

Anthropic also have their prompt engineering documentation - https://docs.anthropic.com/claude/docs/constructing-a-prompt - this article gives examples of bad and good prompts.


>The text is _engineered_ to evoke a specific response.

My grandma can say she engineered Google search to give search results from her location.

> "Engineer" is a verb here, not a noun. It's perfectly valid to say "Prompt Engineering", since this is the same word used in 'The X was engineered to do Y' sentence. >

You guys are just looking for ways to make people feel like they are doing something big in prompting AI models for whatever tasks, even with custom instructions etc

I know the word Engineer can be used in various ways, "John engineered his way to premiership", "The way she engineered that deal" etc, if it's the way it's being used here fine then. There is a reason why graphic designers have never called themselves graphic engineers

> Anthropic also have their prompt engineering documentation - https://docs.anthropic.com/claude/docs/constructing-a-prompt - this article gives examples of bad and good prompts.

This just means that the phrase is already out there. Nothing more.


Your grandma can say she engineered Google but clearly you cant because all it takes is a few minutes to look at the history of the term to answer your own questions. I realize some folks are salty they paid a ton of money for the idea that a piece of paper gives them some sort of prestige. And it does, to 0.001 of humans in the world who are associated with whatever cul...I mean institution that sold you something that is free, with a price premium and a cherry of interest on top. All so you would feel satisfied someone, anyone, finally acknowledged your identity. A great deal of the engineers that built the modern internet never got a formal degree. But they did get something better: real practical experience attained via tinkering.

And so it is.


> I realize some folks are salty they paid a ton of money for the idea that a piece of paper gives them some sort of prestige.

Actually the paper does, but my issue is not papers, rather knowledge. The level of knowledge needed for something to be called engineering

And I have noticed your answers relate prompt engineering to software engineering/programming questions. But if you look at that OpenAI doc, even asking to summarise an article is prompt engineering.

> A great deal of the engineers that built the modern internet never got a formal degree. But they did get something better: real practical experience attained via tinkering.

We have a lot of carpenters, builders, mechanics with no formal education that we call Engineers in our everyday life without any qualm because of their knowledge and experience. Don't look at it only from the lens of software engineering.

I still maintain prompting an AI model doesn't need to be called engineering.

If you are a developer doing it through an API or whichever way, you still doing whatever you've been doing before prompting entered the chat.

Maybe the term will be justified in the future.

Side Note: This conversation led me to Wikipedia (noticed some search results along the way). This prompt business is already lit, I shouldn't have started it


Having to iterate on the prompt to get good, consistent results on a variety of inputs definitely feels like an engineering task


Not really. Though I know these days people use Engineer for all sorts of things


Yeah, next thing you know someone will come up with the term "Software Engineer".


Wasted effort at sarcasm. We actually even nowadays have another nice term called software construction. Which I am fine with.

You can't compare the effort and knowledge


I would have preferred prompt crafting.


The metaphor is more based on social engineering.


this could be wrong and i've missed some of the timeline, but from what i've seen "prompt engineering" started out as a sarcastic joke on twitter about how software engineering roles were going to be reduced to prompt engineering. and then people took the term and started using it seriously.


This can explain it. There is no other reason why one would consider that engineering


There is no such thing as prompt "engineering". It's basically just trying different approaches until you come up with something that subjectively seems good enough. Nothing wrong with that, but let's not confuse the issue by labeling it as engineering.


> It's basically just trying different approaches until you come up with something that subjectively seems good enough.

> let's not confuse the issue by labeling it as engineering.

In my view, "trying different approaches" is a good description of engineering throughout history.

Sure, it's excellent if you can base your engineering on a detailed physical model that lets you mathematically optimize a solution based on your boundary conditions.

But compare that with metallurgy before we had atomic models. It was a process of trial and error. "Let's add small amounts of different alloy metals and see which ones makes the metal harder / more pliable / stainless / etc".

That's still engineering to me. If anything, it could also be called science.


What you're describing is artisanal or craft work. It's a crucial aspect of human endeavor but it's simply not engineering. Real engineering requires a foundation in accepted scientific theory and a consistent body of knowledge.

Call it "prompt crafting" or something like that.


Was there an "accepted scientific theory" on metallurgy in the 1800s when gigantic metal ships were built? Were they not designed by engineers?


Yes, scientific knowledge of metallurgy based on analytical chemistry was fairly well established by the 1880s when the first successful large steel ships were built. It's impossible to produce large volumes of steel with consistent properties on an artisanal basis because the raw materials are inconsistent. Ship designers had also developed sophisticated mathematical techniques for calculating the strength of metal structures, and optimizing for weight and cost. They certainly weren't just riveting pieces of steel together by intuition and hoping it would float.

I do consider those naval designers from the 1880s onward to be true engineers in the modern sense of the word. (At the time, engineers were mostly steam engine operators, so the meaning has changed since then.)

Prior to the 1880s, large ships were generally composite wood and cast iron construction. While there was an aspect of engineering involved it didn't require the same level of theoretical knowledge and design was more artisanal. But that's a gray area.


"It's basically just trying different approaches until you come up with something that subjectively seems good enough"

Sounds like a lot of my engineering! Especially architecture, but generally any higher level code/object/function organization is exactly like this, and in practice even though I know a lot of patterns and have lots of experience and opinions, I often refactor architecture when I'm in a new domain. Which is also true of prompt engineering.


So what you're doing is probably software development rather than engineering per se. And I don't mean that in a negative or critical way. Most software domains don't necessarily require an engineering approach in order to produce good results. I have done the same thing myself. At some level we're just arguing semantics but I think there is intellectual value in being precise with labels.


It reminds me of 'social engineering'. Convincing it to do what you want.


I am frequently frustrated when I see people doing "studies" of LLMs where they don't put in the prompt engineering work. I came upon an example [1] recently where someone compared GPT-4 to Gemini Pro and Claude 2. The results hardly matter to me because they didn't put in the prompt work: they didn't give the model space to think (demanding it return only true/false), and they didn't give it higher-level instructions about the categorization, only vague instructions and a couple examples.

I think this often happens in order to be "objective" about the evaluation. I can see how it feels like cheating to coax the model to produce the answer you want. But... it's not! An off-handed prompt isn't more objective than a crafted prompt. You just haven't investigated its biases and flaws.

This lazy assessment is common everywhere, of course. It's one of the reasons bias gets into testing so easily: you setup a test and you assume that it is objective because you give everyone the same test with the same rubric. But if the subjects don't understand your terminology, or the proctor doesn't understand the subjects' terminology, it's easy to mistake misunderstanding for something else (intelligence, opinion, whatever you are testing for).

Systems based on communication need feedback loops, and that's just to get to the _starting point_. Prompt engineering is one of those feedback loops.

[1] https://www.vellum.ai/blog/best-at-text-classification-gemin...


> I can see how it feels like cheating to coax the model to produce the answer you want. But... it's not!

If it's for a single example, it is absolutely cheating. As an AI engineer this is a particular point of frustration where people complain because a large system can't return the result they want, when they were able to get the answer they wanted on their own with a lot of prompt hacking.

Each prompt is basically a point in latent space, and if you're "tweaking" the prompt what you're really doing is just re-rolling the dice until you land in a neighborhood closer the answer you want. You're not better at prompting, you just got lucky and are confusing that for insight.

Now if you're specific prompting trick works across a suite of evaluations, then you are probably on to something. But what people are doing in most cases is equivalent to performing some ritual before pulling the handle on a slot machine and then, when they finally win, claiming that they finally stumbled upon the correct ritual.


I understand your frustration, but it's more or less what is being advertised as possible and reasonable to try with current AIs. I don't think you'd see the articles as much (or at least not without more refutation) if it was more clear what to expect from AI in its current state. Companies are rushing to implement models into their products without considering all the qualifiers and methods to get good results that you mention in your post. I am not saying you're wrong, but it's hard for me to be frustrated with lay persons when the type of prompting you're frustrated about is exactly what they were told they can do.

AI is pretty fine in its current state for quick look-ups of stuff, but I absolutely agree with you -- without really focusing on the prompt given, the results will be suspect with current models. I am not meaning to discredit or disrespect AI, though I definitely do want to disrespect the way AI is being sold, neverminding how AI is portrayed in media.


> I am not meaning to discredit or disrespect AI

We are already at the point we need to watch our tone online. :)


I am curious how people see this evolving over time as the technology expands to more and more people. Do people get better at investing the time to craft the right prompts? Do shared custom GPTs etc become more the norm? Does the main AI become better at inferring our intent?


All of the above? It’s been shown that users interacting with the same agent (eg Siri) shift their language over time as they discover and internalize like works. LLMs both on their own and through future advancements will surely do the same. It seems natural to expect a symbiotic coevolution of both the prompter and the promptee’s languages.


I like the use of the "Worse/Better" table. It gives people clear examples of amount of specificity (one of my least favorite words to say) needed for common knowledge tasks where the actual need and it's presentation have not been described yet. There should be a lot more of these.

Novice users should be able to adapt those to their own needs easier and craft better prompts rather than completely "thought generating" their own.


Issue is language fails human to human interactions all the time. I would be so bold as to say most people are poor communicators, a machine is not going to read ones mind and intentions any better than fellow humans. It's why military has BLUF communication style to convey information in concise, simple, predictable way. If anything prompt engineering should exist if only to improve humans ability to communicate with other humans.


LLM's are teaching us to communicate clearly.


One of my big hopes is that LLMs help us to become better communicators with each other.

- prompt engineering for clarity (and focus?)

- results (good examples of quality replies)

- assistant (help me say this better)

where better could be a lot of things, depending on the context, here I'm mainly meaning in how we treat each other through communication (politeness, contentiousness, how we behave on social media), like giving nudges to be nicer


Yes. Or more generally, one of the more immediate benefits of the quest for artificial intelligence is the understanding of human intelligence that we gain along the way.

(insert friends made along the way meme, but truly profound)


In the same way water "teaches us" not to drown.


Frankly, this shouldn't be necessary. There are so many easy gains to be had in implementing an LLM-based chat app which do not require any theoretical advances compared to what we have now. All that's needed is a bit of elbow grease from the implementers.


I don't know... some of these are about being clear about what you want, and would work with people just like they work with the LLM. Or a lot of what happens in a conversational chat interface is what could happen in a one-shot full prompt; and maybe that's fine for a casual user but if you are programming something you should put in the effort to get that initial prompt right so the conversation isn't as necessary.

I do agree about planning; one of the disappointments of Custom GPTs (among many!) is that you can't do this planning without letting it all hang out for the end user. That is, it would be great if you could tell the Custom GPT to put its plans inside <plan>...</plan> tags and have those filtered out (or at least hidden by default; they shouldn't be _secret_, but they are distracting).

But even so in that case deciding that you need a plan, and what kind of plan, is something that can and probably should go in the prompt. Not all "plans" are the same, just as not all "summaries" are the same – and part of prompt engineering is getting past these rather lazy descriptions and being specific.

Most summaries are a kind of extraction, and asking for a "summary" is deferring to the LLM to figure out what information is interesting entirely based on its sort-of-common-sense assessment. You can always do better than that! Plans are similar, it's an opportunity to give the LLM a template for planning, to specify goals, things to watch out for, etc. You can usually do better than "think step by step".


Could you give an example?


Planning can be implemented (at high computational cost) trivially by generating hidden responses.

New models can be trained to natively query "authoritative" sources of information, such as databases and computer algebra systems.

New models can be used to transform prompts into more effective ones (along the lines of TFA).


I know for mass adoption LLMs need to support natural language input. But we've done a reasonably good job (note: source for endless arguments here ;) over the past ~80 years of developing a very precise system for inputting exactly what we want a system to do in the form of programming languages.

I'm curious whether any of the leading models - LLMs, image generation models, etc. have taken this into consideration. Particularly in more precise I/O domains (image generation comes to mind), it seems like a structured input format where we remove the entire problem space of natural language prompt -> user intent would make things dramatically easier to get the output we want.


What LLMs are really good at is guessing from fuzziness which is something formal languages are usually bad at. I can often ask for things I don't know much about in the wrong way and it gets to a response that shows where I was wrong and what may have misled me.


Agreed. This is really about defining a command and query language that's much more like the commands in a terminal or a cli. I think the fact that we're moving to this verbose approach is a sort of anti pattern and we'll see levels of abstraction or different methods to reduce it down once again.


sounds similar to semantic kernel


People have been working on very elaborate "super prompts" to drive custom GPT development on OpenAI. Non-programmers have spent hours copying and pasting super prompts together with the hopes that they will cash in on OpenAI GPT Store when it opens, without interest in open sourcing these prompts. Unfortunately, they left back doors wide open and have been.. Pwned.. by ChatGPT-savvy users who have gladly open-sourced the super prompts: https://github.com/linexjlin/GPTs.git


TBH, Often I asked chatGPT to suggest a prompt for a domain I am not good at. For instance, I asked chatGPT to give me a prompt that can help me to give a market(stock) overview at day end with all key insights. chatGPT came up with a good prompt which I then used on Google Bard(I do not have gpt4 subscription hence no access to the latest data). Bard came up with a good 5-7 lines paragraph of text having all key insights of NASDAQ. I later asked Bard to return the key points in JSON format and it obeyed me.


Most (all?) of the strategies described here also work with other language models. Prompt engineering fundamentals are useful (think: get more out of the model at hand) but also transferable.

The deeplearning.ai course by Andrew Ng in collaboration with OpenAI has similar content: https://learn.deeplearning.ai/courses/chatgpt-prompt-eng


Most of these strategies will get better results when working with other humans as well!


YES THIS!!! I always say they're human roleplaying machines. Pretend it's a human, do the same thing you would do with a human on your best day and you'll get better results.

Prompt engineering for me is about empathy in a way, learning to understand where the model's attention goes and leaning into that.


On the "humans need prompt engineering too": well, kind of. I hope that we are not all working on the precise verbiage to get the plumber to fix our toilet correctly. There are also actions we expect others to perform without any communication whatsoever. Human actions take place in a social web that includes incentives and accountability as well as expectations on the part of interlocutors around what level of detail is required for communication. Sometimes this very clearly breaks down. But the fact that interpersonal communications sometimes require elaboration or precision doesn't negate the differences between an LLM's outputs and the actions of people who are engaged in active coping with the world around them.


Prompt design/engineering is more complex than most people think.

If you want content that doesn't look like the crap that currently floods the web, you need to understand how to talk to a model AND have enough domain knowledge to articulate what you actually want.


Prompt engineering won't be around forever. I think of LLMs as being like early computing systems, where you had to work around the limitations imposed by the CPUs, memory and other hardware. Back then they had to implement workarounds like binary math tricks, etc. It was a pain, but that's what you had to do. Eventually the hardware got better, the amount of low level effort was reduced and programming languages got easier to use. No one needs to write assembly any more.

LLMs are on a similar path. Right now, we have to work with the limitations imposed by the current state of LLM functionality. As the technology matures, we won't need to worry about wording input as much.


Write clear instructions Provide reference text Use external tools Split complex tasks into simpler subtasks Give the model time to "think" Test changes systematically

As these best practices solidify, why are they not being built into the UI or product itself for these tools? Seems trivially straightforward besides the last one. For the first one, add an optional persona field and allow query construction in pieces before sending over the wire. Permanently pre-prompt the model to always ask itself how long to "think" before answering, and ask itself intermediate questions if it's nontrivial.


Sometimes, I fantasize about what I would say if a genie in a bottle presented me with three wishes.

"I want a billion dollars."

But what if:

- The money is in a worthless currency

- The money is stolen and must be forfeited

- The money will be given, but on my deathbed.

- The money is in 1 cent coins

It's difficult to state what I'm looking for because there are side effects and interpretations that I can't even imagine, and even if I could, language is imperfect.


It feels like talking to a reluctant employee who does his job halfheartedly and requires elaborate explanations to do an acceptable job.


It's like having a very fast, enthusiastic, eloquent, but also quite sloppy junior employee as an assistant.


Considering the cost of that reluctant employee vs the cost of this machine that seems like a great deal


Real prompt engineering (emphasis on engineering) exists, you just don’t know about it because it only exists for open source models:

https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...


Prompt Engineering to me is the best way to learn how to ask a good question. It's such a great mentor.


Like humans, it takes time to learn how to communicate with an LLM. Also like humans, each LLM needs something a little different


Id love to hear if someone here has experience in pushing GPT to actually not omit code and write out the entire thing you’ve requested. I often need to push it and prompt it to “_WRITE OUT ALL OF THE CODE_” like a demanding Karen.


yep this is a bug that they say they are trying to fix. For now use gpt-4-32k on Azure


Isn't the need for prompt engineering in some ways admitting some kind of user interface failure? Feels like there should be an abstraction layer on top of this.


Such a useful information. If they published it no, I wonder why it wasn't there.


"large language models (sometimes referred to as GPT models)"

lol


why "lol"

Generative Pre-trained Transformer (GPT) is the primary or core implementation for LLMs


I wish my colleagues wrote questions to me the way OpenAI expects me to write questions to ChatGPT.

Provide context? Too busy.

Write step by step? Delimit the question from the context? Why not just copy-paste an error message from somewhere then write below it "please advise"?


Somewhat related to OP, but today I came across the one of the most impressive context-free prompts & responses that I have seen from OpenAI's LLM to date.

It was some overworked parent's pocket dial to ChatGPT, and it shocked me a bit with the apparent understanding of a very random series of prompts, with no context:

https://old.reddit.com/r/OpenAI/comments/18j2k3s/funny_pocke...


Context is (almost) everything. I've the case at work during the day of handling a recommendation I dismissed a few days ago in a PR of mine. The coworker wrote 3 comments showing an increasing understanding of the problem (just a string to format), yet didn't even produced a code suggestion that was doing the same thing as the original.

Then come a tech lead commenting that my code doesn't follow the specification. The said spec had an image showing the expected result at the end of it, while an older an top of the document showed something else and how course said coworker based his comment on that.

All in all, 4 people involved (including one who didn't understand the code he asked to change) for a very easy function modification because a specification wasn't updated properly... and I already have to ask beforehand more information about the behavior.

/half-rant


What luxury! I’d kill for a copy-pasted error message! Usually I just get “it didn’t work.”


I've built email notifications into the major stuff, so on failure, my bot sends them an email with the error message. I've got my address in the reply-to. So they reply "please advise" and hit send.

For the minor stuff... a copy-pasted error message from the UI is luxury, you're right.


I don't know about your work but there are tools/libraries to instrument your applications. For native applications there are things like ABRT, for web applications there's Sentry and a lot of other tools.


People notify you that something didn't work? What a treat! For me the work just doesn't get done, and you have to ask why before you get an explanation that there is some technical blocker.


People tell you that something did not work, after you ask them? I need to manage an email sequence with 8 follow ups to get explanations.


As far as I can tell, this is the same guide that has been up for awhile now.

Is there something new or notable here that explains why it's climbing up the HN page?


There are better guides out there too

- https://www.promptingguide.ai/readings

- https://github.com/dair-ai/Prompt-Engineering-Guide/tree/mai...

- https://github.com/microsoft/promptbase (this one is less of a guide, but is likely the current SoTA)


I also strongly endorse https://www.udemy.com/course/prompt-engineering-for-ai/, if you're into the Udemy-style of learning.


yes it has been up for a while. looked for it in hn search but couldnt find, i guess this is the first time someone posted it here


prompt engineering shouldnt exist. everyone knows how to use language. if people need to learn a new language to communicate with the machine then the idea of language modeling has failed. plain language, with appropriate GUI to upload information, or other kinds of interactions should be enough to make the system obvious to use without any tweaks


A person is able to get whole college degrees in communication. Sure everyone has a basic understanding of their native language, but everyone can use some help now and then learning to communicate better. An LLM is an audience like any other.


>An LLM is an audience like any other.

LLMs are nothing like a human audience. They have no logos, ethios, nor pathos. You can just barely reason with them, they have absolutely no authority on any subject, and only mimic emotions.


Prompt engineering exists because a) LLMs are trained to optimize for statistically average accommodation of the dataset and b) Sturgeon's law: "ninety percent of everything is crap". Therefore, LLMs out-of-the-box will give worse-than-ideal results by design.

The initial proof that prompt engineering worked was around the VQGAN + CLIP days, where simply adding "world-famous" or "trending on ArtStation" was more than enough to objectively improve generated image quality.

The workaround to prompt engineering is RLHF/alignment of the LLM, but everyone who has played around with ChatGPT knows that isn't sufficient.


I think you're missing how this technology works. The basis for these models is training data. And to train anything, you need to label the data. So prompt engineering is simply using the lexicon, terminology and language labeling choices that were in the training data labeling. I suppose the models can conflate different words and terms, but to some degree the more you do that, the less precise or specific you can be in your prompt, and the generated result.


Issue is language fails human to human interactions all the time. I would be so bold as to say most people are poor communicators, a machine is not going to read ones mind and intentions any better than fellow humans. It's why military has BLUF communication style to convey information in concise, simple, predictable way. If anything prompt engineering should exist if only to improve humans ability to communicate with other humans.


"everyone knows how to use language"

If only that were true! Unclear communication is the root of so many problems in human society today.


This is a guide for working with the LLMs we have now, not some perfect future models. The reality is that certain prompt engineering techniques allow you to get better results.

On top of that, humans still require training and instructions for how to write and speak to get the most impact out of their words even when interacting with other humans. The reality is that certain communication techniques are more effective than others, and not always in ways that are intuitive or obvious.


It's less like "you're holding it wrong" and more like "look, when you hold it like this the waves are different!" .. you can be pretty successful with basic language prompts but you can also be deliberate in the way you give instruction.

Much like you can achieve different results with real people if you present your statements with some attention to the intended audience.


I think the eventual steady state future is going to be something between natural language plus (aka prompt engineering) and SQL. A structured query but doesn't need 100% syntactical accuracy+ theres a high likelihood UX will evolve to come with filters/options/radio boxes as default similar to most search these days


just here to add a +1 to this comment. also saying something is "hallucinating" doesn't make it less incorrect.


[flagged]


How do you know this person (Tal Broda) made this? It doesn't seem to list an author.


I have corrected the comment.


nothing you said seems related to the article at hand. are we pushing for a different agenda with this type of unrelated comments?


Do we understand why prompt engineering is still necessary? Why it is unable to correctly determine ("understand") what output the user wants from unstructured input?


Consider how this works with humans. Often you don't have enough input, context, or information to provide the desired answer or outcome.

Humans typically realize this and ask questions, we do this so much you typically don't take note of it. LLMs have yet to do this in my experience.


Sure, but from my experience it seems like humans currently have a much better ability to infer context and meaning from input than these current generation of LLMs.

I assume that as LLMs get better they will be able to produce better output without needing to be prompted in such specific ways.

Or perhaps ask simple and common follow up questions when they detect ambiguity in the request, like humans do.


> LLMs have yet to do this in my experience

I think the key missing ingredient of current AI systems is the lack of internal monologue. LLMs are capable of asking questions, but currently you need explicitly prompt it to deconstruct a problem into steps, analyse these text and decide whether a question is warranted. You basically need to verbalise our normal thought process and put it in the system prompt. I imagine that if LLM could do few passes of something akin to our inner monologue before giving us a response they would do a lot better on tasks that require reasoning.


This is being worked on, look into Chain/Tree of Thought applications

What is missing for me is it recognizing that it lacks enough information to provide a sufficient response, and then asking for the missing information.

- typically, it responds with a general answer

- sometimes it will say it can give a better answer if you provide more information (this has been increasingly happening)

- however, it does not ask for specific information or context, it doesn't ask what if, or if/else, kinds of problem decomposing questions

I do expect these things to improve as we are reaching the limit of raw training data & model sizes. We're primarily in the second order improvements phase now for real applications. (there are still first order algo improvements happening too)


This makes sense to me. LLMs would benefit tremendously by using clarification prompts. Instead they spew output with whatever confidence level their creators deem is good enough.


Doesn't matter how good a language model gets at guessing what the user wants if the user is still asking ambiguous questions.


For the same reasons communication between people is hard.


I've never needed a communication engineer, except maybe when dealing with the opposite sex!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: