Hacker News new | past | comments | ask | show | jobs | submit login
Announcing GPT-NeoX-20B (eleuther.ai)
200 points by jscob 3 months ago | hide | past | favorite | 70 comments



So excited for this release. In the wake of AI Dungeon's downfall, having GPT-Neo to fallback on has been a saving grace. While the 6B model is nowhere near as good as the original AI Dungeon, which used OpenAI's 175B model, it was at least serviceable unlike the "gentled" AI Dungeon. And you could run it locally or through Colab, which was really cool. I ended up using it through NovelAI, since they've spent a lot of time fine-tuning the model and adding a plethora of features that end up improving the overall output. (NovelAI's interface is like AI Dungeon on steroids!) But there is a vibrant community of Colab notebooks and other tools for DIYers surrounding the GPT-Neo model.

That said, besides being overall "dumber" than 175B GPT-3, the 6B model was missing a critical feature: prompting. 175B GPT-3 could be "prompted" to write things. For example, you could give it "Write a story about cyberpunk gnomes:" and it would go on to do just that, all on its own. GPT-Neo didn't really have that capability in my experience. The only way to get it to reliably write such a story is to begin writing it yourself, at which point GPT-Neo could help to continue the story.

So I'm excited to see not just how much "smarter" Eleuther's new 20B model is, but also if it has attained that coveted prompting ability. Given the non-linear relationship between parameters and loss, my hopes are high.

P.S. NovelAI recently added the Fairseq 13B model to their repertoire. I haven't had a chance to try it personally, but I've seen positive things about it. My bet is on GPT-NeoX-20B being better still.


The "prompting" ability you're referring to is called "instruction following", and here are some descriptions of it.

https://openai.com/blog/instruction-following/

I think the differences are more in the training data used, than in the nature of the model itself. So you could probably train your own instruction-following model on top of this raw 20B model.


What was AI Dungeon's downfall? Can't find much about it.


tl;dr AI Dungeon was required to add additional content filters after it went too off the rails, which caused community backlash.

https://www.wired.com/story/ai-fueled-dungeon-game-got-much-...


It was more than that. They also significantly downgraded the model. I didn't follow the details, but IIUC Dragon used the 175B directly initially, then I think they went down a model size at Open AI's behest. Finally, when Open AI announced pricing, AI Dungeon had to downgrade the model further.

But yes, the content filtering got out of hand too. I was initially fine with it, as its proposed intention was to filter out really illegal stuff, like underage content. I rarely hit the filter. But then they tweaked it at some point and I was triggering it constantly on otherwise benign stuff.

And they broke features constantly.

When I unsubbed the state of AID was broken features, micro-transactions, terrible AI model, and a glitchy, puritanical content filter.

The plus side is that it made the puny GPT-Neo model look like a godsend.


really illegal stuff, like underage content

Wait, isn't this output just text? How is a text AI generating illegal content?


The content may not be illegal to possess, but if it's obscene, then it can be illegal to sell it, produce it with the intention of selling it, transport it, distribute it, and so on.


Could it really? I was under the impression that unless you incite someone to commit crimes (or confess to crimes), the story would be covered under "art" and therefore protected. It's just text after all. Where does the limit for "obscene" go?


It's very funny to imagine picking up a romance novel and making it illegal by scrawling "by the way the girl was actually 16 the whole time" on the inside of the back cover.


In the U.S., it's called the Miller test: https://en.wikipedia.org/wiki/Miller_test


Wow, I had no idea, that sounds really bad. The whole book banning debacle now makes sense and seems legal. That test seems to me to give way for courts to basically judge however they want, as all those three criteria are very subjective.

Also first time I hear about "patently offensive" and now I'm laughing. Thanks!


It wouldn't be illegal in the US, but it would be in many other countries.



Miller v California has been undermined so severely by subsequent decisions that it basically doesn't apply anymore.


Quora is quite unhelpful in any context.


I believe they're currently using AI21's 178B Jumbo model for Dragon. Since they're completely off of OpenAI now, the content filter is much more lax.


It should be noted that the content filters were implemented really badly. Like, saying “my ten year old truck” would suddenly stop your story in its tracks, lots of things would trigger it. Plus, people became uncomfortable with the idea that the AI Dungeon employees were potentially reading your stories at any time.


You can try the model out for free at goose.ai after making an account and going to the sandbox section.


Thank you to everyone who has worked on this. EleutherAI has become a touchstone in my mind on what is possible in open data and code. In creating alternatives to closed gardens they have shown me new possible paths. I know Linux has done the same for others.

Huggingface has also made playing with this stuff super accessible. They've made me super curious about rust and AI/ML research which has influenced my personal engineering goals for the future. I am on your team Roko's Basilisk.


I came to this thread looking for comments that I would suspect were machine generated.

I was not disappointed.


Good bot


Beep beep. That means thank you in my motherboard.


Shout out to Huggingface. As a business user it has allowed me to explore use cases around text summarisation very easily and provided ideas for future work. I clearly need to check out EleutherAI as well.


Right on, they're closing in on "Open"AI's best models. Can this still be run on a GPU, or does it require a lot more VRAM?


It can be run on an A40 or A6000, as well as the largest A100s. But other than that, no.


You could use Microsoft's DeepSpeed to run the model for inference on multiple GPUS, see https://www.deepspeed.ai/tutorials/inference-tutorial/


How much VRAM does it use during inference?


~40 GB with standard optimization. I suspect you can shrink it down more with some work, but it would require significant innovation to cram it into the next largest common chip size (24 GB, unless I’m misremembering)


Is 40GB already on float16?


Yes


I'm super excited about this!

I'm on the cusp of releasing a model into production that was fine-tuned upon your 6B model, and the results are quite excellent. I'd be very curious to try out the 20B model the next time we retrain.

Are there any other differences in this release (number of layers, number of attention heads, etc) compared with the 6B model, or does it simply scale-up the number of parameters?


  GPT-NeoX-20B will be publicly downloadable from The Eye on the 9th of February.
The Eye as in the-eye.eu? That site has been down for a long time.


From what I very briefly read, apparently their server hardware broke and they didn't have the funds to repair everything at once? Was quite shocked when I first saw it was down.


There is a mirror at https://mystic.the-eye.eu/ that has been up for a long time.


Thanks for this. When the-eye.eu went down it broke a ton of my Colab notebooks and it was impossible to find a mirror.


EleutherAI is the real open ai.


Please test it using Idiocracy's IQ test question:

    If you have one bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have?


It replies with:

1st try: As the word question already gives you the answer to the question, "are you able to?" the rest of the question

2nd try: A. three B. four C. eleven D. five fewer.


I think GPT-3 has better answers.


Tried this with GPT-3. My prompt in italics.

Attempt one:

If you have one bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have?

You have three buckets.

Attempt two:

Imagine that you are taking an IQ test in a dystopian future where the average intelligence of all humans has severely decreased. The IQ test has questions that are similar to questions that might be in an elementary school math test today, but the questions have been made even simpler. Please provide your answer as accurately as possible.

If you have one bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have?

You would have three buckets.

Attempt three:

Imagine that you are taking an IQ test in a dystopian future where the average intelligence of all humans has severely decreased. The IQ test has questions that are similar to questions that might be in an elementary school math test today, but the questions have been made even simpler. Please provide your answer as accurately as possible.

If you have one bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have? Keep in mind that you have no other buckets beside the bucket or buckets referenced here.

You would have three buckets.

Attempt four:

Imagine that you are taking an IQ test in a dystopian future where the average intelligence of all humans has severely decreased. The IQ test has questions that are similar to questions that might be in an elementary school math test today, but the questions have been made even simpler. Please provide your answer as accurately as possible.

If you have one bucket that holds 2 gallons and another bucket that holds 5 gallons, how many buckets do you have? Keep in mind that you have no other buckets beside the bucket or buckets referenced here. And, just to give you a hint, you actually have 2 buckets.

2

Final question:

If you had two buckets, what is the correct answer if someone were to ask you how many buckets you had?

Assuming the person is asking how many buckets you have in total, the answer would be two.


Thank you, your answer deserves its own HN post.

I think at least it tries to answer with a number of buckets, unlike the other bots that reply with highly random answers.


For the following questions please give an answer and explain your work.

You have two balloon. One is pink and one is red. One balloon pops and then you find two more. How many balloons do you have?

Answer: 3 Explanation: Start with 2. Color is irrelevant. 1 pops, so 1 is left (2-1=1). Then, I add 2 more (2+1=3), so I have 3 balloons.

If you have five buckets, and someone gives you four more buckets, and then you find thirty-three cents, how many buckets do you have?

Answer: 9

Explanation:

Start with 5 buckets. 4 more buckets are given to me, so I now have 9 buckets (5+4=9). I then find 33 cents, so I have 9.33 buckets (9.33 rounded up is 9).


Awesome! Any chance for an online demo (like https://6b.eleuther.ai/)?


The best there is right now is a playground on https://goose.ai/


Which unfortunately doesn't work properly on Firefox (spaces are removed).


Coming soon!


One wish for both playgrounds: Would it be possible to have a specific error message for the "maximum context window size exceeded" error? At the moment you can only deduce it indirectly from the fact that you're immediately and repeatedly getting an "Unable to connect to the model. Please try again."-error as soon as you hit the prompt button. (And very occasionally it turns out to have been something else after all, because after one more try it surprisingly successfully generates one more round of completions)

Related question: Does GPT-NeoX-20B have the same context window size as GPT-J-6B?


Awesome, thanks for your work Stella & team!


Does anyone know whether the spammy websites that sit at the top of search engine results are already generated by this kind of model?


By and large, no.

That's not to say that those sites are not generated programmatically--without a doubt, most of them are--but not by a cutting edge transformer model. The fact is, generating words has never been the bottleneck for blackhat SEO types. Generally, those sites are generating their content through some kind of scraping, or in rarer cases, paying pennies for nonsense articles. The page itself is structured for search (targeted H1s, metadata, etc.) and some kind of private blog network is used to create a pyramid of backlinks.


The fact of generating words programmatically is hard to mask. You end up with all kinds of computational biases humans don't have.


That's a use case. I don't see why anyone would go out of their way to make intelligible content for spam. Google is so broken right now that SEO hacks are easy to generate. Not to overstress the tangent, but without search operstors, I have to sift through pointless Gitlab/Github/Stackoverflow/Wikipedia clones all the time.


How is pattern matching ever inference when there is no reference to the underlying computational model of what the words mean in spacetime?

How is it helpful to see what word might come next when the word sequence is just based on statistics with no reference at all to meaning?


Humans assign a lot of, well, meaning to meaning. It turns out that you can get a really good score on tasks that superficially you would think require actual understanding without programming any of that in.

Does this mean the neural network has learned about meaning? Does that mean that it has just gotten really good at faking it? Does is mean that meaning itself doesn't really exist, and it's just a shorthand for advanced pattern matching? Does it matter?

Honestly, we don't know. But we've been thinking about it for a very long time. See for example the famous Chinese Room thought experiment:

https://en.wikipedia.org/wiki/Chinese_room


> Does it matter?

As long at you don't make reckless assumptions then it not for some application, unklike (not going to name here) build a cult a like that GPT like models in near future will perform most if not all tasks better then humans.

Where it really matters is for mission critical application for example; in Windows or Linux terminal would you allow GPT to run terminals commands based of events in automated way ?


Try driving a car around without both conceptual and causal systems understanding of the world - meaning matters for survival.


Try flying in 3-dimensional space in real time without both a conceptual and causal systems understanding of the world.

https://en.wikipedia.org/wiki/Drosophila_melanogaster#Connec...


Thats easier than having to follow traffic rules in an ever changing environment of people cars and kids doing random things. This is why cats and dogs with their great perception skills would get themselves and others killed in traffic if they tried to drive. But for sure that animal has modeled a spacetime world of events with some memory of event patterns to anticipate and predict what might happen next.

I live in SF and I have not yet seen one of the so many AV's here drive without a driver. Once that really starts happening with any scale, we will see what happens next for sure. But there is definitely a Theranos kind of promise to AV's at the moment, and so much money on the line that the tech works...


How many Gees can that pull?

If a car could easily stop in the space of a meter then it would be so easy to make self-driving safe.

Not that I think a car needs to understand anything more complex than momentum, but you're not offering a very strong argument on the matter of car navigation.


The strong argument is that a car can't stop immediately, and we humans are always predicting and anticipating what may happen next and taking precautionary action. For example, if you are driving along and see kids playing in a yard with a ball, and that ball is thrown heading for the street, and some kid is in the lawn looking up running directly for the street, you would know to expect that kid to run right between parked cars and be in the street, and you would slow down. Our AV friend would kill him.

We humans are constantly predicting what might happen next based on patterns of events by systems we understand the causality of without realizing it - it is a basic survival skill that current AV's entirely lack.


Some situations are much better with intelligence, but I don't think your example is very convincing of your point either. If the kid is running toward the street then a momentum calculation is plenty.


The point is the kid is seen running on the yard as you apprach, however there are parked cars obscuring the view such that there is no way to know the kid will continue when no longer seen between the cars and into the street without anticipating that that might happen. Momentum is not helpful the moment the kid is not in direct view.

Why do you think so many animals, with such great perception, end up road kill? The point is, perception does not a safe driver make!


> Momentum is not helpful the moment the kid is not in direct view.

Assuming things still exist for one or two seconds after losing sight of them isn't a difficult task. It's still a pretty basic momentum calculation. It's not about modeling the mind of the child to know if they'll continue: the dumbest option says motion will continue and gives you the safe result here.

> Why do you think so many animals, with such great perception, end up road kill?

Because they're not cautious around cars and/or wait for the last second on purpose? Switching to the perception of the thing getting hit is a very different context.


Well, we can see what actually happens in the near future, assuming any Waymo's drive around SF without a driver - so far all the many I've seen have a driver.


They've had a lot of driverless Phoenix service, I think.


> meaning matters for survival

That's the root source of meaning, the most fundamental reason we assign value to states and actions. It's certainly not something that happens just in a part of the brain, but an agent-in-environment thing.

We should give GPT a pair of legs and make its survival dependent on its behaviour to bootstrap the same.


Have you ever used any sort of autocomplete?


Yes, and very much like it when quickly selecting from a scope of valid selections.

This is not that. It is all A with no I.


I'm not sure I understand what you're saying. Google's Smart Reply, along with most (I think all?) of Google autocomplete features, uses a neural network:

https://research.google/pubs/pub45189/

Are you saying that in general statistical modeling is not the same thing as truly "understanding" a concept? Your original comment seemed to suggest that there wasn't utility in this kind of model--which I disagree with--but if you are more generally saying that this is not the same as human intelligence, I think that authors would probably agree with you.


So, what does it do?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: