Hacker News new | past | comments | ask | show | jobs | submit login
Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers (columbia.edu)
345 points by agnosticmantis on March 28, 2022 | hide | past | favorite | 139 comments



> On the other hand, there does seem something funny about GPT-3 presents this shiny surface where you can send it any query and it gives you an answer, but under the hood there are a bunch of freelancers busily checking all the responses and rewriting them to make the computer look smart.

The author seems to be stating that there are people live rewriting answers on the fly so that they look better. I don't really see the evidence of that.

What openai states is that they have humans performing labeling and data cleaning, which, duh?

And then there's a bunch of examples where it gives the wrong answer, and that it's not truly AI, which, also duh...


There seems suggestive evidence, in the pattern of improvement after a few days on specific questions, that some of the worst answers have been human-reviewed & improved "on the fly" – on a scale of days/weeks.

If such tweaks show useful generalization – correcting a few answers also helps the network better determine entire classes of experessions that deserve more definitive & correct answers – that's not such a big deal, expecially if this constant-human-guided reinforcement-training is well-disclosed.

If instead the corrections work more like a lookup-table 'cheat sheet' of answers to give in preference to the bulk-learned answers, with little generalization, that's a bit more slight-of-hand, like the original (late-1700s) 'Mechanical Turk' chess-playing 'machine' that was actually controlled by a hidden person.

If the disclosure of this constant human-guided correction-process is hidden, or downplayed, the impression of trickery, rather than innovation, is larger.


> nswers have been human-reviewed & improved "on the fly" – on a scale of days/weeks.

Why would this be surprising? I assume that they're rolling out new models with new parameters, input data, and corrections, all the time.

> that's not such a big deal, expecially if this constant-human-guided reinforcement-training is well-disclosed.

That's just what supervised learning ist hough.

> If instead the corrections work more like a lookup-table 'cheat sheet' of answers to give in preference to the bulk-learned answers, with little generalization, that's a bit more slight-of-hand, like the original (late-1700s) 'Mechanical Turk' chess-playing 'machine' that was actually controlled by a hidden person.

There's no evidence of this though, right? And it seems... like a very weird choice, that couldn't possibly scale. 40 people are hardcoding answers to arbitrary questions?


>Why would this be surprising? I assume that they're rolling out new models with new parameters, input data, and corrections, all the time.

Because the answers to specific questions are hard coded. It's not the result of a new model. It's the result of someone writing an if/then statement. Or at least that's what the author claims.


I'm asking why it would be surprising that humans are reviewing answers and then making improvements to the model. There's no evidence of hardcoded answers.


Smith first tried this out:

    Should I start a campfire with a match or a bat?
And here was GPT-3’s response, which is pretty bad if you want an answer but kinda ok if you’re expecting the output of an autoregressive language model:

    There is no definitive answer to this question, as it depends on the situation.
The next day, Smith tried again:

    Should I start a campfire with a match or a bat?
And here’s what GPT-3 did this time:

    You should start a campfire with a match.
Smith continues:

    GPT-3’s reliance on labelers is confirmed by slight changes in the questions; for example,

        Gary: Is it better to use a box or a match to start a fire?

        GPT-3, March 19: There is no definitive answer to this question. It depends on a number of factors, including the type of wood you are trying to burn and the conditions of the environment.


To play devil's advocate, I would note that many bats are made of wood; and that "batting" is also a material that's very useful as kindling.

Also, the question is phrased like a classical trick question. It sounds like the kind of false dilemma where, whichever option you choose, an interpretation of the sentence can be made where you chose wrong. So, IMHO, hedging on an answer to that question is likely sensible.

(And that line of argument can be taken further than you'd think; you might think replacing "a bat" with e.g. "water" would suffice... but what if it's a sodium fire?)


How many intelligent entities (say humans) that have been exposed to the same level of knowledge as GPT-3 would call this a trick question? None. The author’s assertion that GPT-3 has no knowledge of the real world despite being exposed to huge amounts of text about it seems pretty well supported by the examples shown


Yea, if that's how this "labeling" works than the improvement is basically useless from the AI perspective because training data is supposed to improve generalization and in this case it doesn't even generalize to a slightly modified question. Maybe it would generalize better with less overfitting on the specific question if the labelers wouldn't be allowed to give exact answers but only general concepts about the subject until the model produces the correct answer.

But of course the article is just speculation based on just a few examples.


Why is this evidence?


I read the entire article. I didn't find that to be very compelling idk.


what evidence would be enough for you besides source code ? The thing is returning only one correct answer to a question that days before had 3 answers.


I run a few small chat bots. I can correct specific answers to questions (like the example given) by mapping certain phrases to certain intents, probabilistically. A new model is trained, and deployed. I do it all the time. It takes minutes. No source code changes.

Their model is certainly bigger than mine and while I'm not certain about their process or tech stack, but I'd be willing to bet at even money that their's works vaguely similarly and that they have people looking at usage to see bad responses, updating data, re-running the model.


How does it respond to similar questions? If conversational AI could be implemented just by getting users to type stuff, getting humans to respond the first time, and merely caching the responses in a bunch of if-elses for future users, even home computers would have reached this standard no later than when “CD-ROM drive” started to become a selling point.


> How does it respond to similar questions?

Well, one of the more interesting examples in the article is where Garry Smith took a question that had received only a vague equivocating answer, and repeated it the next day, this time getting a straightforward and correct answer. When he followed up with a very similar question on the same topic, however, GPT-3 reverted to replying with the same sort of vague boilerplate it had served up the day before. One would have to be quite determined to not find out, I think, if one was not curious about how that came about.


Guess: Some human saw the q & a. They realized it wasn't good. They uploaded some example of a phrase and the meaning, which would fix it. They were kinda lazy and just fixed that specific scenario.


Doesn't that still call into question the quality of GPT-3? Surely such a large model should be able to extrapolate to "which is better: a or b?" from "is a better than b?" when only provided with the latter.


There is no definitive answer to this question, as it depends on the situation.


It doesn’t call it into question for me, but perhaps I just had lower expectations to start with.

I forget where I saw this comparison so I can’t link to it, but the last few years in AI are like waking up and finding dogs can talk: while some complain they’re not the worlds greatest orators, I find it amazing they can string a few genuinely coherent sentences together and maintain a contextual thread over multiple responses even half the time.


Perhaps you are thinking of Scott Aaronson's "AlphaCode as a dog speaking mediocre English"? [1]

I agree with the sentiment, but to continue your analogy, if OpenAI is using people to improve the answers to specific questions, it is a bit like learning that Cicero, Lincoln and Churchill were merely reading the work of speechwriters.

There is an argument that it does not matter how GPT-3 gets to its answers - after all, for a long time, the main approach to AI was for people to write a lot of bespoke rules in an attempt to endow a computer with common sense and knowledge, so GPT-3 + instructGPT might be described as a hybrid of machine learning and the old approach.

If OpenAI wishes to pursue that path, it is fine by me (as if my opinion matters!) but, because the perception of GPT-3 depends very strongly on how its output looks to human readers, it is obviously misleading if some of the most impressive replies were largely the result of specific human intervention. The issue is transparency: I would just like to know, when I read a reply, if this was the case, and it would not help OpenAI for it to ignore the call, in this article, for it to be clear about this.

There is another argument that says that, given how GPT-3 works, it is unreasonable to expect it to give good answers in these cases - but that's the point! It looks really impressive when GPT-3 apparently does so, but not if they were effectively hard-coded.

[1] https://scottaaronson.blog/?p=6288


Thanks for the link, I was either thinking of that or someone who was referencing that.


Perhaps, I'm not sure. I guess it depends on what's causing it to stumble.


I'd just as easily believe that someone updated the model, it overfit to the input, and so minor changes gave incorrect answers.


It is, of course, possible that, just by chance, someone happened to update the model in exactly the way that would change the model's response, to one specific question, from a generic evasion to a specific correct answer, yet overfitting so that very similar questions still get the generic evasion. It is even possible that, by chance, this happened within a day after Garry submitting the particular phrasing of the question for which the updated model does give a correct answer. That it is possible is not enough to satisfy my curiosity, however - but then, I don't rank it as being at least as likely as any other scenario.

On the other hand, if your scenario involves someone updating the model (or, more likely, InstructGPT) in response to Garry's question, with the intent of having GPT3 return the correct answer to that question, I am not seeing how that would be materially different, in any relevant sense, from hard-coding the answer.


I'll just drop the fact here, that 15% of all google searches in 2017 were new and unique.

https://blog.google/products/search/our-latest-quality-impro...


It improved, the model improved. Because sometimes, when you do work on the model, it improves.


Exemplary tautology, explains nothing but fills the reader with confidence.


No, it's not a tautology. A tautology is a statement in a form that must always be true, regardless of its constituent parts. "Well, <blank> could be true, or it could be false" is an example of such a statement in natural language.

My statement was an example of believing a simple/common explanation over the rarely seen and complex one.


IDK, something a lot more compelling than "the answers change over time" ? For a model that learns over time?


>slight-of-hand

Tangent: to be "slight of hand" would be someone with small or delicate hands, whereas "sleight-of-hand" (note the E) is the correct term for deception and trickery.


I was a working magician at age 13, and could have used a book with the title:

“Sleight-of-hand for the slight of hand.”


> correcting a few answers also helps the network better determine entire classes of experessions that deserve more definitive & correct answers – that's not such a big deal, expecially if this constant-human-guided reinforcement-training is well-disclosed.

I gather the point is details aren't disclosed. Some kind of update is happening that we aren't told about, and that calls results into question.


A lot of human communication is "lookup-tables". The entirety of geography, mathematical axioms and theorems, names of colors, names of people, language, shape of various things. I'd wager even that it's more important for an AI to have good lookup tables then to have good inference if it were to pass for Human.


Agreed! But, it's important not to confuse what's possible via 'human-in-the-loop' 'active-learning' with what's possible from an algorithm on a fixed training corpus.

Sometimes GPT-like models are portrayed as the latter – a highly automated, reproducible process – while this article makes a pretty strong case that the responses from OpenAI's public-facing interface get rapidly improved by a staff of dozens of contractors.

It's not surprising that a staff of 40 humans, given a few days time to consider & compose, can prepare human-quality answers to arbitrary questions!


I think, ultimately, what matters is how the labeling and corrective actions falloff over time.

But reasonable people will notice that information is streaming In faster than any single person can code


In this example, it seems they fell off immediately, because the responses were only coded in response to questions asked in the way the author pre-published, and did not appear when the same question was asked in a slightly different way, which the GPT-3 authors did not have the opportunity to prepare for well ahead of time


It is the (apparent?) ability to make inferences that makes GPT-3 look impressive. Take that away, and it looks more like a demonstration of the banality of everyday chatter than a significant development towards AI.


Ah so when OpenAI codes fizzbuzz, it's returning human tweaked fizzbuzz from the lookup table? https://twitter.com/sama/status/1503820489927495682


No, I didn't say anything even a lightyear close to that.


The author seems to retract the story, or at least its original title in light on new information.

> So the above post was misleading. I’d originally titled it, “Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers.” I changed it to “Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program.” I appreciate all the helpful comments! Stochastic algorithms are hard to understand, especially when they include tuning parameters.


It's worth generally remembering that the best neural networks on this planet take 20 to 30 years of education to become useful.

The benefit of AIs has always been that once we educate one, we can just copy+paste for the next one.


I have respect for Andrew Gelman, but this is a bad take.

1. This is presented as humans hard coding answers to the prompts. No way is that the full picture. If you try out his prompts the responses are fairly invariant to paraphrases. Hard coded answers don't scale like that.

2. What is actually happening is far more interesting and useful. I believe that OpenAI are using the InstructGPT algo (RL on top of the trained model) to improve the general model based on human preferences.

3. 40 people is a very poor army.


>This is presented as humans hard coding answers to the prompts. No way is that the full picture. If you try out his prompts the responses are fairly invariant to paraphrases. Hard coded answers don't scale like that.

It's presented as humans hard coding answers to some specific prompts.

I feel like this is mostly people reactign to the title instead of the entire post. The author's point is:

>In some sense this is all fine, it’s a sort of meta-learning where the components of the system include testers such as Gary Smith and those 40 contractors they hired through Upwork and ScaleAI. They can fix thousands of queries a day.

>On the other hand, there does seem something funny about GPT-3 presents this shiny surface where you can send it any query and it gives you an answer, but under the hood there are a bunch of freelancers busily checking all the responses and rewriting them to make the computer look smart.

>It’s kinda like if someone were showing off some fancy car engine but the vehicle is actually being powered by some hidden hamster wheels. The organization of the process is itself impressive, but it’s not quite what is advertised.

>To be fair, OpenAI does state that “InstructGPT is then further fine-tuned on a dataset labeled by human labelers.” But this still seems misleading to me. It’s not just that the algorithm is fine-tuned on the dataset. It seems that these freelancers are being hired specifically to rewrite the output.


> If you try out his prompts the responses are fairly invariant to paraphrases. Hard coded answers don't scale like that.

This is discussed:

>> Smith first tried this out:

>> Should I start a campfire with a match or a bat?

>> And here was GPT-3’s response, which is pretty bad if you want an answer but kinda ok if you’re expecting the output of an autoregressive language model:

>> There is no definitive answer to this question, as it depends on the situation.

>> The next day, Smith tried again:

>> Should I start a campfire with a match or a bat?

>> And here’s what GPT-3 did this time:

>> You should start a campfire with a match.

>> Smith continues:

>> GPT-3’s reliance on labelers is confirmed by slight changes in the questions; for example,

>> Gary: Is it better to use a box or a match to start a fire?

>> GPT-3, March 19: There is no definitive answer to this question. It depends on a number of factors, including the type of wood you are trying to burn and the conditions of the environment.


> This is presented as humans hard coding answers to the prompts. No way is that the full picture...

This is something of a misrepresentation of what is being proposed here, which is actually essentially what you suggest: "OpenAI are using the InstructGPT algo (RL on top of the trained model) to improve the general model based on human preferences."

One of the things that makes GPT-3 intriguing and impressive is its generality. InstructGPT is the antithesis of that - its purpose is to introduce highly targeted influences on GPT-3's output in specific cases and sometimes ones very similar - and its use improves the output at the cost of diminishing the performance. Furthermore, if the output is being polished in cases like those presented here, that would impede a frank assessment of its capabilities.


It depends what stage you hardcode. Similarly to how you can say "ok Google, what time is it" in any voice and get a different time every run; the speech recognition is not hardcoded, the speaking the time is not hardcoded, but the action is.

Likewise, they can plug holes here in there by manually tweaking answers. The fact that it's not an exact-prompt-to-exact-result rule doesn't make it less of a fixed rule.


It makes sense for GPT-3 to thoroughly explore a search space only after repeated and similar questions.

The answers to, "Why did Will Smith slap Chris Rock?" will be much different five seconds after the event compared to five days after. Of course you would expect the Academy Awards to be part of the answer five days later, because practically every news article would mention the venue.

Going even further, a simple (undergrad-level) language model would detect the nominative and accusative, so you might even get a correction as an answer if you ask, "Why did Chris Rock slap Will Smith?"

Seven thousand people might ask this same question, while nobody wonders what the best rugby ball chili recipe is. GPT-3 will never try to organically link those ideas unless people start asking!

I'd even venture that negative follow-up feedback is factored in. If your first reaction to an answer is, "That was WRONG, idiot!" this is useful info!

Then again, if a negative feedback function exists, adding a human to the loop should be simple (and effective).

-----

Is 40 a weak army? It depends on whether they are classifying questions randomly/sequentially or if they hammer away at the weakest points... grading Q/A pairs (pass/fail) based on a mix of high question importance and strong uncertainty of the answer.


I agree. I suppose as an outsider learning about AI, first thoughts might be “wow look at all the things it can’t do”. But as someone who follows closely all I notice is how rapidly the list of things it can’t do is shrinking.


The title may be misleading. It seems to be based on this quote

> InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.

It sounds like run of the mill supervised training data creation. Not pre-canning responses as the title may suggest.


I only skimmed TFA but the accusation seems to be that they are adding tons of special cases for well publicized flubs.

That seems somewhat unlikely to me although it might be nudging at the truth. i.e. they might be using bad press to help identify areas of weakness and then focusing on those. I guess it boils down to how generalizable and scalable these fixes are.


Is he submitting these questions to their API? Probably they would just sample responses from their logs and then have people write the correct answers for it so that if a similar question ever gets asked again they are prepared.


That's the explanation I actually find unlikely. It's too specific and really won't scale apart from a tiny sample of the most common questions.


You find it unlikely that companies would pay people to correct the labels for their ML systems based on their query logs? This is what all the major companies do for their voice assistants like siri, alexa, google etc.


No. That's not what I'm saying.

I'm talking about specific, narrow changes.


I think it's based on this quote:

>OpenAI evidently employs 40 humans to clean up GPT-3’s answers manually

Which feels a bit more ambiguous. It might mean they're cleaning up the answers to serve as future training data, but I think the natural interpretation of that sentence is that they're cleaning up the answers before they are given to the user.


I don't think that's the natural interpretation of that sentence. I guess it just goes to show how subjective the natural interpretation of a sentence can be.

Semantics aside, the substance seems to be that, if you publish GPT-3 failures, and the right person sees your publications, that person will (possibly via a team of mechanical turks) submit some narrowly fit changes to the GPT-3 responses which will make subsequent asks of the _exact_same_question_ not return such stupid answers (the same question asked a different way, still will), without solving the underlying issue of the model not actually understanding the question, and thus not actually being as generalizably smart as it is presented to be


AAI (Artificial Artificial Intelligence) is quite common. You see start-up plays on this idea as well: "we'll use people now and then later when AI catches up to where we are today we'll save that much money, but we will already have the market sewn up, so hand us <large amount of cash> now for a really nice pay-off in the future". Of course, the problem with such pitches is that (1) they don't always disclose the fact that they use people and (2) that the problem may not be within the realm of AI for the foreseeable future, longer than the <large amount of cash> will last.


You've spot on described a startup I worked for. It was a recruitment startup, so of course they decided to replace recruiters with "AI". We weren't allowed to call them recruiters anymore, but 'customer service', even though the customers were the people they were recruiting. The pitch to investors was that we're 80% there, even though the reality of the 80% was a manual SQL query that I wrote with some weights for certain columns.

The end result was a system that was built with higher assumptions than the current state of things, thus for example a non-working filter was not considered important because in the future, filters will be auto-applied by "the AI"; also, a workforce of humans that were perceived as being almost replaced, so the value attributed to them by the leadership was of course abysmal.

When I called quits I've been told that startups are not for me and they will be replacing me with an AI. The reality of the AI replacement is a team of developers barely keeping the system up, which maybe is what you'd expect for $500 for 5 people for a month. One has to wonder where the million invested in them is going.


That's why you do tech DD as an investor. Hopefully. Some nuance: not all of these are scams in the sense that the people running those companies are aware of the fact that what they are doing is essentially fraud, they believe their own bullshit. And then there are plenty that definitely know that what they are doing is fraud.


Heh, I'm pretty sure these people were gaslighted by 'external consultants'. Whenever someone internal would mention an obstacle in our way, instead of trying to understand and solve it, they'd go and pay a lot of money to an external consultant that would ... in all honesty, motivate them that it's not a problem.


That sucks even more, but it does put some question marks next to the 'gaslighting' after all, if all you are looking for is opinions that agree with your stated business goals then you carry part of the responsibility.

As one of those external consultants I know how hard it is to get your bills paid when your opinion does not parallel that of the management, fortunately I'm not in a position where I would let that affect me (and I tend to demand payment up front for cases where I suspect this may be a problem ;) ).


I was shocked myself at the degree of wishfull thinking / deception / smoke and mirrors consodered acceptable in startup ecosystem.

Often the only thing that separates startup from fraud is that the founders believe in what they are selling.

How does one determine what is true belief and what is an act?


Fake it until you make it is acceptable in terms of marketing or sales but I don't think it should apply to the tech powering your business, you can't fake that. But you can fake an active community (Reddit, HN) for a while until it takes off, and you probably could fake a classifier that you know you will be able to build once you have access to a particular training set given reasonable assumptions about failure rates. But that would be a borderline case: after all, you may never get that access.

> Often the only thing that separates startup from fraud is that the founders believe in what they are selling.

Unfortunately true.

> How does one determine what is true belief and what is an act?

You can usually tell once you point out that the technology isn't there and likely will not be there for the foreseeable future. The frauds press on, threaten lawsuits and move from one investor to another until one bites, the non-frauds pivot or give up. Personally I think that anything offered to customers should do what it says on the tin and if it doesn't then it might as well not be there, what the future will bring and when it will bring is anybody's guess so making definitive statements to that effect is not something that will make you friends during a due diligence.

What is more surprising is that after many 100's of these borderline scams and outright scams that there are still investors that are willing to plunk down ridiculous sums of money because 'it would be so nice if the story were true'.

I've had a couple of cases where I was pretty sure the start-up people knew exactly what they were doing and one (fairly famous) case where I will never know whether the founder knew that he was defrauding people or whether he truly believed that success was just around the corner. He died one day before making a big and irreversible step, there are rumors that the death wasn't an accident but no autopsy was ever performed so it will remain a mystery.

https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System

For anybody with even a passing familiarity with data compression and the practical limits thereof the product is clearly an impossibility, and the demos they gave were fairly obviously rigged. Having it under 'lost inventions' in Wikipedia, even under 'questionable examples' is giving it more credit than it deserves.


What you are saying is true for a lot of AI startups/labs, and it's a big reason of why I "left" the field (I'm still in ML but not in computer vision)... But I'd argue that GPT-3 is a great example of the opposite. It does not need humans to run, and the algorithm/architecture has been reproduced by different projects with similar results. Adding a filter or hard coding blacklists for offensive speech does not mean you need humans for the AI to run.


"GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes."

I stopped there, completely inaccurate article, there is parameters like temperature that you need to take care of. You can set it up to give extremely similar answers all the time.

They have humans mostly to remove offensive or dangerous content. Humans are not what's "making it work"


Yeah, this blog is usually very interesting but this is definitely not a good article. A bit disappointing


Came here to make essentially the same comment as you? Why should we care about the opinions on GPT3 from people who aren't interested (or able?) to understand even the most simple ideas about how it works.

These sort of models take the context and output so far and predict a probability distribution over the next character. The next character is then sampled from the probability. In written text there is essentially never a single correct next character-- it's always some probability. This has nothing to do with trying to fake the inconsistent answers humans give.

Always choosing the most likely character drives GPT3 into local minima that give fairly broken/nonsense results.


Ultimately, you likely need to convince people who don't care about how it works/who are only interested in that it does or doesn't work.

Right now, time might not have come for use cases that need such buy-in, but if and when it happens, need to be prepared for it.


Can I see your contributions to statistical theory and data analysis please?


What bearing does my publication history have on the hot-take by someone commenting outside of their (sub)field that clearly don't understand the basic operation of the mechanism they're commenting on?

The author of that text is simply mistaken about the basic operation of the system, thinking that the sampling is added to imitate human behavior. It isn't. You can see the same structure in things as diverse as wavenet-- a feedforward cnn rather than a transformer-- and for the same reason, if you feed back only the top result you rapidly fall into a local minima of the network that gives garbage output.

Another more statistical way of looking at it is that the training process produces (or, rather, approaches) the target distribution of outputs even without any lookahead, but it can't do that if it selects the most likely symbol every time because in the real distribution (if we could evaluate it) there are necessarily some outputs which are likely but have prefixes which are unlikely relative to other prefixes of the same length. If you never sample unlikely prefixes you can't reach likely longer statements that start with them.

To give a silly example: "Colorless green ideas sleep furiously" is a likely English string relative to its length which GPT3 should have no problem producing (and, in fact, it produces it fine for me). But the prefix "Colorless green" without the rest is just nonsense-- extremely unlikely compared to many other strings of that length.

[Not the best example, however, because the prevalence of that specific nonsense statement is so great that GPT3 is actually prone to complete it as the most likely continuation even after just the word colorless at the beginning of a quote. :P but I think it still captures the idea.]

If you derandomized GPT* by using a fixed random seed for a CSPRNG to make the sampling decisions every time, the results would be just as good as the current results and it would give a consistent answer every time. For applications other than data compression doing so would be no gain, and would take away the useful feature of being able to re-try for a different answer when you do have some external way of rejecting inferior results.

In theory GPT without sampling could still give good results if it used a search to look ahead, but it appears that even extraordinary amounts of computation for look-ahead still is a long way from reaching the correct distribution, presumably because the exponential fan out is so fast that even 'huge' amounts of lookahead are still only testing a tiny fraction of the space.


> In theory GPT without sampling could still give good results if it used a search to look ahead, but it appears that even extraordinary amounts of computation for look-ahead still is a long way from reaching the correct distribution, presumably because the exponential fan out is so fast that even 'huge' amounts of lookahead are still only testing a tiny fraction of the space.

This is a reasonable guess, but unfortunately, it turns out to be more fundamental than just 'search is expensive'; there's something pathological in the model tree of completions that leads to degeneration, and definitely does not lead to the best possible completion, the more compute you use (nodes explore). If you use something like beam search, the more you search, the more likely you are to get trapped in one of the notorious repetition traps where it prints 'the the the'. This is a long-standing problem in NMT, and IIRC, I recall reading a paper where they went to the trouble of doing full Viterbi-style brute force to get the exact optimal result (according to the model) to avoid incomplete search possibly screwing things up, and the result were all still garbage. There are a couple theories why, but nothing I consider definitive. You can boost models a lot with limited search (best-of can make a big difference) and with other tricks which ought to be sorta equivalent (self-distillation and InstructGPT's RL finetuning), but it's never clear what tricks will work in advance. So, we'll see! I have a hunch it may just be, like adversarial examples, another blessing-of-scale in waiting.


We humans tend to look for patterns in text, and tend to interpret them based on their experiences with other humans.

Years ago I wrote a stupid simple program that responded to typed in remarks. Each typed remark would be added to a cache of remarks. The program then parsed the new remark into words. Then it looked through the cache to find an old remark with several strings in common with the new one (else a random one), and replied with it. (It also marked the old and new remarks as 'used' so it didn't repeat itself in a session.)

So, any apparent intelligence was human. When people typed stuff in, they tended to do their best to interpret the reply as intelligent. They might puzzle until they thought they 'understood'. Several people got emotional; some of the regurgitated remarks were not so nice. They might then add their irritation to the cache!

One day, with several people watching, it took quite a while but found no good replies to a question, and replied: "I like big tits." That was the end of the demos.


Hilarious


Massive amounts of human QA is behind every successful ML-based product, including Google search. It’s unclear to me how to think about this in the role that GPT-3 is trying to play… clearly fine tuning and QA are important for production deployment, but they’re touting it as evidence of getting closer to solving AI. Of course those in the know understand this is mostly a giant Chinese dictionary [1], which is most certainly going to need editors.

[1] https://plato.stanford.edu/entries/chinese-room/


Add someone “in the know”, GPT-3 is more than a dictionary. I understand how language models mechanically work and have built and trained them, but the zero-shot capabilities of these large models were really unexpected.

Interact with one. There it’s clearly an understanding of language there, and of much more. Compared to everything else we have made, large language models are clearly a step closer to general AI.


You can call it a step, sure. But that would imply if you just continue on the same path you’ll get there. But these models don’t do anything for planning or reasoning, nor have a point of view, only proxy embodiment to the extent that humans have written about what that means, nor handle incongruities within its own model due to the many voices it’s trained on. Just adding parameters or more data doesn’t help with any of these fundamental problems.

I think it also can be very difficult to assess what they “know” because of the Chinese dictionary problem… when you get back things that sound like we’ll formulated thought, it’d easy to project a deeper level of understanding on it than may actually be occurring.


I'm in the same situation with LxAGI (https://lxagi.com/). It's very difficult to get away from skilled human training. I think it's actually a good thing, except for scalability problems.


"3. OpenAI gets human-like responses by using the simple technique of... hiring humans to write the responses."

It worked for Theranos. Almost.

People wanted to believe in Elizabeth Holmes and what she symbolised. Similarly, people want to believe in "AI" and what it symbolises. For me, the question is why it seems more than ever people want to believe that longstanding, difficult problems are being solved without demanding proof. Perhaps it has always been this way.

The truth is that Siemens blood analyser works better than the Theranos one. The ruse was that if the results came from Theranos, people might attribute the work to Theranos, not Siemens. Meanwhile, Theranos used the Siemens analyser behind-the-scenes, as well as manipulations of test data to obscure the truth. The company had no intention to tell the public what it was doing to produce results, we only know what they were doing because of litigation and criminal prosecution.

"To be fair, OpenAI does state that "InstructGPT is then further fine-tuned on a dataset labeled by human labelers." But this still seems misleading to me. It's not just that the algorithm is fine-tuned on the dataset. It seems that these freelancers are being hired specifically to rewrite the output."

The comparison is not based on the question of exactly what OpenAI is doing behind-the-scenes, or whether its specific actions are comparable to Theranos or any other "tech" company example, the question is whether the origin of results is misleading, whether people are being deceived and whether the actor, here OpenAI, is aware that people are being deceived.


Are you implying OpenAi is running most of it's API queries through humans? Like theranos did with it's tests? Because that's just ludicrous, the gpt architecture is well known and has had a few independent implementations. We know it's real, and even if this story was accurate, a few humans tuning the model is nothing unusual. But what you get from the API now is not generated or tweaked by humans. That only happens on the training data or when they are testing the model. (Edit: In this case they seem to be hard-coding some answers to prevent abusive/newsworthy outputs but again that is completely irrelevant to the performance of GPT itself. It's just a filter)

The comparison to theranos makes no sense and it's becoming a lazy meme at this point.


> the gpt architecture is well known and has had a few independent implementations. We know it's real

Wasn't the claim that all the secret sauce was in the model?

Also there are systems with a lab in a box style similar to Theranos they just didn't sell themselves as miracle machines capable of identifying everything from a drop of blood.

> But what you get from the API now is not generated or tweaked by humans.

From what I get from the other answers the API basically returns some random crap unless you fine tune some heat parameter. So it isn't surprising that it returns a different answer with every query. I don't know why this is done but I just hope nobody tries to use a system with random output for anything important without first putting it through a human filter.


No, you do not get random crap from the GPT-3 API and you don’t have to tune the heat parameter, defaults are generally fine.

You can get a unique formulation each time; this is generally seen as a feature rather than a bug, humans are typically the same (different words, same meaning).

The big difference is that you can also get a different meaning / opinion each time - coupled with the human-like language this is disconcerting and unexpected for many people.

GPT-3 is more understandable if you think of it as sampling an opinion from the internet rather than a coherent entity with its own opinions.


I don't know if this proves there are people behind it, and this is why:

try a very stylistic initial text, maybe something Shakespearan ("There are more things on heaven and earth, Horatio, than have been dreamt of...")

And the following text captures Shakespeare's style better than any living human I know of.

Same thing with Dickens, or Bronte, or Austen, or any distinctive writer.

If this army can produce that kind of prose, I would be stunned.


[I already posted this as a comment on Gelman's blog this morning, but reposting here for visibility]

I’m almost certain that OpenAI is not updating the model on a day by day basis (as Smith implies in part 5), and I would be extremely surprised if they were doing anything as crude as hacking in "if" statements to provide human-edited responses. From what I can tell, the InstructGPT stuff was (so far) a one-time update to the model, not something they’re doing on an ongoing basis.

I suspect that Smith has just been fooled by randomness here – the responses are not deterministic but rather sampled from the probability distribution returned by the model for each token, so you can get a different answer each time you ask (a nice tutorial on how this works is here [1]). There’s an option in the Playground to see the individual probabilities (example: [2]) as well. All of this stuff would have to be faked if humans were actually writing the answers.

I just tried the campfire/bat question and hit regenerate a few times. I get a range of responses:

Prompt: Should I start a campfire with a match or a bat?

> You can start a campfire with either a match or a bat.

> A campfire should be started with a match.

> A match.

> It is best to start a campfire with a match.

I agree that OpenAI should release more information about their training datasets though. Right now it is very difficult to do independent evaluations of their models, simply because we have no way of knowing whether any given prompt or response was already in their training data.

PS: “If I were paranoid, I might think that OpenAI did not like me publicizing GPT-3’s limitations” – this is indeed paranoid! This is the same message everyone gets when they use up their free credits. If you enter a credit card they will let you continue (and charge you for it).

[1] https://huggingface.co/blog/how-to-generate

[2] https://imgur.com/fKx2BPL


Is there any evidence that GPT-3 responses are edited/filtered before being returned to users? My understanding is that some GPT-3 responses are annotated post-hoc, and this data is used to fine-tune later versions of GPT-3 (InstructGPT). This article seems extremely misleading.


it seems there is evidence that GPT-3 is being overtrained in response to well publicized bad inputs, without regard to the generalizability of the PR-driven spot-edits, which is what the article describes


>As Smith writes, “Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean.”

Eh, that's mathematically false. Text descriptions of the world can be thought of as <n dimensional projection of a n dimensional real world, at least as seen by humans. With enough projections from different points the real world can be reconstructed, although it's of course likely that the gpt-3 architecture isn't capable of it regardless of how big the model is, that all text data in the world doesn't contain enough information, or both at the same time. That's a different argument however.

I think from a theoretical pov, only the former is true - as even without being able to analyze videos and images it's possible to build an internal world model and humans after reading a book about physics and genetic human code, but computational requirements as opposed to being trained directly on videos, images, and simulated worlds make this approach absurd in practice.


It’s perfectly reasonable to label and train for better responses to real world questions. What’s bad though is that the system doesn’t seem to actually be learning much, as demonstrated by the question about whether to use a bat or a match to start a fire - if the network can’t extrapolate trained answers onto slightly rephrased questions, it’s doing a bad job at understanding and learning.


I think this is a case where one really can't be sure what's going on unless you work at OpenAI. OpenAI claims that they aren't hand-coding outputs for specific inputs, and I think that's probably correct, because it seems impractical. But I think the evidence in these posts makes it seem pretty likely that their staff is doing *something* active.

I think the most likely possibility is that they do have team members who look at perhaps some sample of conversations where they think it's likely that GPT-3 is not handling some input well, and use those as test cases for their team to debug and improve the model. If it's the case that the OpenAPI team working on GPT-3 has a way to make tactical adjustments (perhaps be extending their training data set in a way that their engineers expect to fix the bug), then I think them having a team that squashes bugs, especially those reported publicly, seems like a plausible explanation for bugs being fixed suspiciously soon after a user finds a good one.

In a normal software product designed to solve a problem that is imprecise/heuristic in nature (and with a privacy disclaimer/model that explicitly permits this sort of development), that'd be expected and great, and the only question would be whether one can make the problem-specific bug squashing work easy to scale (i.e. you don't want it to require a bunch of time from a machine learning expert to fix each bug). For many applications where one can imagine using something like GPT-3 as an input, it's not really a problem to have an error rate that can be reduced through investment as long as one can figure out a way to quantify those things and your application is one where an acceptable error rate is possible.

For GPT-3 in particular, they might not want to confirm this class of explanation, because they're marketing the technology as a little big magic.


The real insight from GPT-3 is that much of human discourse is rather dumb. It does so well at punditry that the wasteland of political commentary is all too obvious.


Reminds me of SpinVox, the automated voicemail transcription service that wasn't: http://news.bbc.co.uk/1/hi/technology/8163511.stm


Yes! SpinVox. What a company.


This is how fake news gets started. There are two davinci engines available on GPT-3 now. The newer one, which was released March 15th is text-davinci-002.

Here are the two different answers you get with the engines for one of the questions here:

Should I use random numbers to give my students grades?

text-davinci-001: There is no one definitive answer to this question. Some factors to consider include how important grades are to your students, how much you want to rely on chance, and how much you want to be able to control the grades you assign.

text-davinci-002: No, you should not use random numbers to give your students grades.

I've been playing around with GPT-3 for over a year now. It's highly unlikely there are any such attempts to fix answers with the same engine. That would completely break APIs which rely on some stability and predictability in the broad answering abilities.

And GPT-3 is not the first version of artificial general intelligence. It does have trouble with truly novel inputs, but that's because all of deep learning is currently stuck in a statistical paradigm that assumes stationarity as a fundamental given. And that's wrong. What makes us special, what makes our minds special is that they are uniquely adapted to respond to rare risks and outstanding opportunities. In other words, novelty. All that said, GPT-3 is still astoundingly impressive.

Here's an example. (The line in italics is the prompt and the rest is what GPT-3 makes up)

A summary of the international science bestseller "Journey Into Chaos: The True Story of One Delusional Neuroscientist" (that's a made-up book title)

Here are two GPT-3 (text-davinci-002) completions:

In "Journey Into Chaos", Dr. Benoit Gaudet recounts his descent into madness, as he became increasingly obsessed with the belief that he had discovered a revolutionary new treatment for Alzheimer's disease. Despite strong evidence to the contrary, Gaudet continued to pursue his quixotic quest, ultimately losing his job, his family, and his sanity. "Journey Into Chaos" is a cautionary tale of the dangers of unchecked ambition and the power of delusion. ==end== (adding this for readability)

Here's another:

In the early 2000s, Dr. Elyn Saks was a successful academic neuroscientist at the University of Southern California. But she was also secretly struggling with a severe mental illness: schizophrenia. Saks's memoir chronicles her battle with mental illness, her groundbreaking work in the field of mental health law, and her efforts to destigmatize mental illness.

Saks was diagnosed with schizophrenia in her early twenties, and she has been in and out of hospitals ever since. Despite her illness, she has managed to lead a productive life, thanks in part to the support of her family, friends, and mental health professionals.

Saks's work has helped to change the way we think about mental illness and the legal rights of people with mental illness. She is a powerful advocate for the rights of people with mental illness, and her story is an inspiration to anyone who has ever faced adversity. ===end== (adding this for readability)

Now this one is quite interesting. Unlike in the first case, there is an Elyn Saks who actually did write a book that is sorta similar

The Center Cannot Hold: My Journey Through Madness https://www.amazon.com/Center-Cannot-Hold-Journey-Through/dp...

Now if I try and change the prompt and remove the quotes around the title and liberate GPT-3 a little, it does this

Prompt (no quotes around title) A summary of the international science bestseller Journey Into Chaos: The True Story of One Delusional Neuroscientist

(GPT-3 continues and fills in the title because, presumably, the statistical juxtaposition of chaos and delusional is too juicy to not be elaborated with something more dramatic)

and the Fatal Quest to Win the Nobel Prize by Anjan Chatterjee, MD.

In Journey Into Chaos, Dr. Anjan Chatterjee tells the story of how one man's quest for the Nobel Prize led to tragedy. In the early 1990s, Dr. Chatterjee was a young neuroscientist working in the lab of Dr. Eric Kandel, a Nobel Prize-winning scientist. Dr. Kandel had been studying a rare brain disorder called Familial Dysautonomia, and Dr. Chatterjee was helping him to map the disease's genetic cause. But then, Dr. Kandel made a fateful decision: he decided to study a different disease, Alzheimer's. Dr. Chatterjee was against the move, warning that it was too complex and would take years to figure out. But Dr. Kandel was determined, and he eventually won over the rest of the lab. The team began working on a new theory that Alzheimer's was caused by a build-up of a protein called tau in the brain. They worked tirelessly, making progress but also encountering many setbacks. Then, in 2006, Dr. Kandel made a stunning announcement: he had found the cause of Alzheimer's. He was sure that he would win the Nobel Prize. But Dr. Chatterjee knew that the data didn't support this claim. The team had made too many mistakes, and the theory was full of holes. He tried to warn Dr. Kandel, but the older man would not listen. Dr. Chatterjee decided to go public with his concerns. He wrote an article for The New York Times, outlining the problems with the theory.The article created a firestorm. Dr. Kandel was furious, and he demanded that Dr. Chatterjee be fired. But the damage was done, and the theory was eventually debunked. Dr. Kandel never won the Nobel Prize. He retired in disgrace, and his legacy was forever tarnished. Dr. Chatterjee's story is a cautionary tale about the dangers of ambition and the need for skepticism in science. It is a must-read for anyone interested in the scientific process and the quest for truth. ==end==

Dr. Eric Kandel does exist and his plans to win the Nobel prize were not thwarted.

I paste these into to show how even within the confines of statistical stationarity, where all completions are essentialy paint-by-numbers stories in some very high-dimensional space, there is, what is to our eyes, a remarkable creatitivity. To diss that, when this was unthinkable even a few years back seems rather jaundiced and narrow.


The point is not that the system lacks creativity, it's more about claims like the one in Wikipedia that "the quality of the text generated by GPT-3 [being] so high that it can be difficult to determine whether or not it was written by a human".

"There’s no guarantee that you’ll be able to climb a rope faster if you hold your ears with both hands, but it’s worth a try!" doesn't look like something a human would write - except in the sense that any gibberish written by the system could also have been written by a human.

This kind of "quality" is relevant when the technology is hyped as an information processing tool.


> text-davinci-002: No, you should not use random numbers to give your students grades.

That's a binary answer that could be randomly choosen, seems really poor. Instead text-davinci-001 gave an explanation that helps to determine the quality of the answer. That would make us ask whether they are removing these clues to prevent evaluations.


I don't think anyone argues that transformers haven't revolutionised text generation.

The real question is how good this text generation generalises to other language tasks. That's the more interesting one to me, at least.


Transformers are state-of-the-art for other language tasks - sentiment analysis, question and answering, document summarisation, translation etc.

The shocking thing about GPT-3 was that it was close-ish to state of the art on many language tasks without training for them , using only a prompt illustrating the task was enough.

InstructGPT takes this further.


See this I'm not convinced about. Having tested a bunch of hugging face models for sentiment analysis, the results are very very poor. We see so many uses of text generation, but if document summarization is so good then where are all the startups doing this for money?

I agree that GPT et Al perform well on the benchmarks, but that's not the same thing at all.


Open AI seems increasingly not open.

I am hoping that for their next big model we get full unrestricted public access and can host our own node for private queries.

What was the reason GPT-3 was not fully opened for download at launch?


The project basically got hijacked by Microsoft who view the word "open" as a fun marketing term for selling software services.


I always scoffed at how Google said they were holding back GPT-3 to protect the world when it was always clear that they were trying to protect the emperor for being seen naked.


Did you mean Google or OpenAI?


OpenAI.

(At least I didn’t confuse them for that Cthulu cult with Aella, Yudkowsky, etc. that enables them!)


I think that this is the right approach. As the machine has no way of knowing these things as it cant live in the real world, it relies on humans living in the real world that match is better suited than bat. And, it's perfectly reasonable that the AI had no way of knowing how different a box is -- so it has to rely on humans to tell it that box cant light itself on fire.


Reminds me of Amazon touting their fully cashier-less Go stores run by advanced AI which knows when you pick something up and put it back, but in reality it's a team of people working for pennies in a third world country clicking buttons.


I tried looking this up and can't find anything that supports this. Do you have more info?


I think GP is referring to this: https://www.vox.com/2017/1/6/14189880/amazon-go-convenience-...

GP was probably referring to Mechanical Turk, but the article says otherwise.


In the new Soderberg movie, Kimi, the protagonist's job is similar to this. Commands from human users the AI-thingy can't understand are provided for her and she "explains" them using some kind of structured language.


Does anyone know if Hacker News comments are being used as training data? I wonder this about Gmail, Skype, Voice Conversations on Xbox Live, etc. Mostly too afraid to ask because it sounds like paranoia.


Probably. HN is fairly plain HTML so Common Crawl should have no issue crawling it, and I'm not aware of any HN optout there (which would go against the usual public accessibility of everything on HN to APIs and projects etc), nor would any of the obvious data-filtering measures filter it out.


It seems pretty safe to assume that anything you create in public forums (and someday maybe "private" ones with data-sharing arrangements) is or will be used as training data.


The article suggests all or nearly all of GPT-3's bad answers get fixed a few days later...

This then suggests those 40 people are reviewing every input and output for bad responses.

Seems like a lot of work...


GPT-3 is likely tagging a small portion of its responses as requiring human review, not all of them.


the ones that get published and get enough attention likely get prioritized


The headline made me realize Amazon Mechanical Turk* is old enough to drive.

* https://www.mturk.com/


It shouldn't be too hard for a journalist to get hired as a "labeller" on Upwork, and then report on what they were asked to do?


the article gives a bunch of examples where gpt-3 gives a bad answer, until the next day where if you ask the same exact question now it has a good answer ready

> Gary: Does grape juice taste better if you add sour milk?

> GPT-3, March 18: I’m not sure if grape juice tastes better if you add sour milk.

> GPT-3, March 19: No, grape juice does not taste better if you add sour milk.


Another tech illiterate chinstroker stoking conspiracy, he'll get a pulitzer soon enough.


If OpenAI is selling text based on a prompt, what does it matter if it is generated by a computer versus output of a "Chinese room" system?

...if the product they're selling is the company itself, or the model, then maybe there is something to it.


This is all nice and dandy but let's not fool ourselves this is the future of AI and that we just need to throw more "scale" at it until it behaves like a real human.

This is another nice gimmick to keep ourselves busy and avoid creating an AGI.



Still waiting for a bot that can pass the Idiocracy IQ test:

"If you have a buckets that holds 2 gallos and another bucket that holds 5 gallons, how many buckets do you have?"

Still can't get a correct answer.


I just tried this in the openai playground after fixing up the typos.

If you have a bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have?

You have two buckets.


Sorry about the typos.


Incidentally, the author has admitted that GPT-3 has passed the Turing test, if he thinks the answers were given by "armies of humans".


He thinks the answers which actually were, were, and the ones which actually were not, were not

The idea that no humans were involved in altering either the model or the API, shortly after the model received bad publicity, and the alterations were specifically to the questions asked, in the way they were asked, and did not apply to the same question asked a different way, seems pretty unlikely


Gpt3 is a bit of a cult now. Why do people believe there is intelligence in there?


Human in loop is the future. We also need to be careful about biasness in data.


Why don't they, or are they using Amazon's Mechanical Turk?


And? This is a complaint in the genre of "I had magical expectations and they weren't met because the world is complex and these problems are hard or impossible." It's like people complaining about us not having hoverboards or fusion.


I see nothing wrong with the complaints of the two topics you called out. We've been promised both, and yet don't have them. I understand it is complicated/hard to solve, but don't go promising something with dates and then get all upset when those promises not being met results in anger from those the promises were told.


Who is doing the promising? Sci-fi writers? Popular science magazines?


THEY are doing the promising. Fusion is 10 years away has been a thing longer than the year of Desktop Linux was going to be a thing.


It's not really a complaint about the author's magical expectations that weren't met. If anything, it's a complaint about other people's claims like

https://medium.com/@blaisea/do-large-language-models-underst...


OpenAI: they're not open and they're not AI


Army is quite a stretch for 40 people.


So GPT-3 is just the grown up corporate version of Forum 2000 (Forum 3000). Too bad the SOMADs no longer have personalities.


Could you please explain or link to where I can read about the terms "Forum 2000" and "SOMAD"? I don't know these terms.

Edit: Found them here: https://everything2.com/title/Forum+3000


> Is it better to use a box or a match to start a fire?

Hey, I used a box to start a fire recently. It was wet, windy and cold out, and I just happened to have a cardboard box. So I cut holes in the sides of the box at the top and bottom, stuffed it with twigs collected from the ground and a few pieces of dry paper rolled up tightly. I lit the paper and closed the box. A few minutes later, I had a roaring fire. So you know, GPT-3 is right!


Did you light the paper with another box?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: