Hacker News new | past | comments | ask | show | jobs | submit login
Hello Dolly: Democratizing the magic of ChatGPT with open models (databricks.com)
494 points by hnuser0000 11 months ago | hide | past | favorite | 173 comments

I might be having a moment - but I can't find any links to a git repo, huggingface, or anything about the models/weights/checkpoints directly from the article.

I just see a zip download that AFAIK also doesn't contain the weights/checkpoints. I find this a bit odd, the contents of the zip (from the gdrive preview) look like they should be in a git repo, and I assume they download the model from somewhere? GDrive usually has rate limits which I'm concerned about.

If anyone from databricks reads this - are there plans to publish this on a git repo somewhere, as well as the weights/checkpoints?

EDIT: Oh I just noticed

> Contact us at hello-dolly@databricks.com if you would like to get access to the trained weights.

This... seems odd for a article titled "Democratizing the magic of ChatGPT with open models"?

Full source code is up here now:


Sorry it took us a day to get the external repo setup.

Awesome thank you!

Was the Alpaca dataset being licensed as non-commercial only the reason you aren't releasing the weights? Is it possible to just release them under the same license?

Yes the issue is that some of the training data is arguably tainted with some noncommercial license (it's nuanced, discussed below in my comment). We are releasing weights to people who request but we just wanted to have an email request flow so that we can make sure people know it's just for noncommercial purposes.

Working on a model without this issue. Certainly our goal is totally open models anyone can use for anything.

Understandable, thank you for the response!

I've been a bit jaded by the "open/democratizing ai" stuff and then having companies stiff us at actually making it open - but not wanting to be the first to litigate these new types of issues ml brings is very understandable.

Question - Would you consider benchmarking a single 4090 for your training? While training in a few hours with 8x A100's is impressive, myself and I think others are curious how that translates to consumer hardware. IMO running/fine-tuning on consumer hardware is the ultimate endgame for all ai models.

Look forward to a response. We are heading toward a 6X Bizon 4090 system as a test bed.


The README also says this:

> This fine-tunes the [GPT-J 6B](https://huggingface.co/EleutherAI/gpt-j-6B) model on the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) dataset using a Databricks notebook.

> Please note that while GPT-J 6B is Apache 2.0 licensed, the Alpaca dataset is licensed under Creative Commons NonCommercial (CC BY-NC 4.0).

...so, this cannot be used for commercial purposes

Essentially every model worth anything has been trained on a unfathomably large amount of data under copyright, with every possible licensing scheme you could imagine, under the assumption that it is fair use. While you can argue that it's all built on a house of cards (and a court may well agree with you) it's kind of arbitrary to draw a line here.

> under the assumption that it is fair use.

No, because you as a human looking at "art" over your lifetime and learning from it is not "fair use" of the copyright, it's no-use at all. This is the crux of every argument for both for language models and AI Art models, that their tools are learning how to draw, learning what styles and characteristics of input art correspond the most with words, and creating art with that knowledge just like any other human, not simply collaging together different pieces of art.

Fair use via "this is completely impossible to regulate so you might as well embrace it"

> ...so, this cannot be used for commercial purposes.

The legal relation between models and training data sets seems murky; of course, with the build tooling, you can also substitute in another instruction-following training set if you want to avoid licensing issues with the Alpaca set, whereas if you aren't concerbed with them, you can just blaze ahead.

> ...so, this cannot be used for commercial purposes

The implication being that you're only "democratizing" something if people can make money off of it?

Kinda? I (personally) read "democratizing" as intending to be for the benefit of many over the few. Bit duplicitous to preclude access to the means of production in that definition, "for many rather than the few (BUT, wait, the actual economic benefit and utility is still locked away for the few)".

But maybe "democratize" is starting to mean something similar to "open". All the good words.

> ...so, this cannot be used for commercial purposes

or you can raise $30,000,000 right now and worry about the copyright infringement lawsuit in 2026 or never.

> ...so, this cannot be used for commercial purposes

Can't they also release the fine-tuned weights as non-commercial as well?

As far as I know the copyright situation for models is ambiguous and also depends on the region. In the US you can't copyright data made by an automated process but you can in the EU, or something to that effect.

Lol. This is classic ML crap. Files with no documentation, no links, multiple files with the same-ish name but no explanation for which one is what.

Yes, the ZIP on Google Drive owned by one of their engineers is weird considering they have a pretty active GitHub presence of open source projects, though it does use an Apache license like their others.

Perhaps Databricks suspected another big announcement coming soon and wanted to get this announcement out?

Are they pulling a Facebook, on model access?

I think they are dodging unclear legal issues surrounding certain steps of the model-building process while being as open as possible with the components given that constraint, allowing downstream users to make their own legal risk vs. effort choices.

Given the hardware/energy need to train it be nice, to have a legal document that said something like this model has no warranty, it may be a break through machine or a hand grenade. Use at you own risk!

Yes, this.

From what I can tell they're fine-tuning EleutherAI's GPT-J.

Alpaca was made to fine-tune LLaMa, however they also released their dataset they used to do this, and it looks like Dolly is this dataset applied to GPT-J, and does not use LLaMa itself.

Thanks I missed that email while skimming

So it’s another classic private only model that they’ll pull as soon as the suckers have trained it up for them

Databricks is on a roll

Does anyone else find it ironic that all these ChatGPT "clones" are popping up when OpenAi is supposed to be the ones open sourcing and sharing their work?

I guess: "You Either Die A Hero, Or You Live Long Enough To See Yourself Become The Villain"?

> when OpenAi is supposed to be the ones open sourcing and sharing their work?

OpenAI renounced being open source. Don't let the name fool you.

I think all of the "AI alignment" talk is mostly fearmongering. It's a cunningly smart way to get ignorant people scared enough of AI so they have no choice but to trust the OpenAI overlords when they say they need AI to be closed. Then OpenAI gets a free pass to be the gatekeeper of the model, and people stop questioning the fact that they went from Open to Closed.

AI being tuned to be "safe" by an exceedingly small set of humans is the thing we should be afraid of. It's the effective altruism effect: if you bombard people enough with "safety" and "alignment" speak, they will look past the fact that you're mainly interested in being a monopoly. My bigger conspiracy theory is that Bill Gates getting behind "AI alignment" is a calculated move to get people to look past Microsoft's unilateral involvement.

I don't know what press releases you've been reading, but the model is closed so they can make money off it, that's pretty obvious.

I think that is a simple take and underestimates the insidious nature of the AI alignment initiatives. Or maybe I'm overestimating it.

At this point I'm really not sure what they're up to in terms of grand strategy. I don't even know that making money is their ultimate goal. At a certain level of ambition money is just a tool to get what you really want.

It's interesting to note that Altman has no equity in the company. One of the primary motives of being a for-profit company that was espoused was to be competitive with big-tech AFA bringing in top-level research talent.

I don't think that Altman's lack of equity position in OpenAI means anything at all when it comes to what OpenAI's goals are.

We know what their immediate goals are: to make as much money as possible. The only question is what their longer-term goals are.

Seems like it would make sense for that to be the real reason, and the safety concerns to be a convenient scapegoat, although from talking with several people who work at OpenAI, they really do seem to believe the safety/alignment issue deep in their bones. I could almost be led to believe that the massive business advantage of keeping it closed is a happy side effect for them and not the actual reason.

> they really do seem to believe the safety/alignment issue deep in their bones

The employees are usually they are "in on the message". All you have to do is get your employees to believe the message. This is easy when the company is growing exponentially. CEO's word is gospel.

Sam Altman has turned into a megalomaniac.

Possibly, but it is a bit unusual that he has zero equity in the company. So it might not be for monetary reasons.

I still haven't fully ruled out that his consciousness has been replaced by an AGI

AI and high-performance semiconductors are the only technological fields where the US and allies haven't been surpassed by Russia and China.

There is probably a lot of political pressure on OpenAI to be as closed as possible. Remember the US government has banned Nvidia from exporting A100/H100 to China/Russia. Those are the same chips OpenAI uses for both training and inference.

Anyone in China/Russia who can comment on the actual situation? How difficult is it to train/run AI models where you are living?

Russia is simply importing A100s through shell companies in UAE.

In which fields have Russia surpassed the US? I get China, but Russia?

> Surprisingly, instruction-following does not seem to require the latest or largest models: our model is only 6 billion parameters, compared to 175 billion for GPT-3.

We started seeing this in our testing. OpenAI's Curie model is responding very well to our fine-tuning experiments for chatbot-style interface. I am trying to keep us focused on quality of training data rather than obsessing over raw network size. Davinci (and derivatives) might turn out to be overkill for our use cases.

Interesting. DALL-E, Dalai (https://cocktailpeanut.github.io/dalai/), and now Dolly are all pronounced the same way.

It feels like there should be an xkcd for this.

AFAIK DALL-E is pronounced as Dalí, as in Salvador Dalí.


It's quite clearly a reference to WALL-E the environmentally conscious robot, which is pronounced as you'd expect. I like to think of it as DALL-E the surrealist robot painter.

I totally failed to make that connection! Was that the intended reference? What's the link to WALL-E?

WALL-E is a robot.

That is exactly my interpretation. Both wall-e and Dalí. I think we are in agreement.

Handy also to think off WALL-E. At least that where my assumption came from.

I figured it was a reference to the Dalai Lama (which doesn't invalidate your comment, since that's also pronounced like Dalí). LLM -> Llama -> Dalai Lama

I thought "Dalai" pronounced "Dall Eye" rhymes with "Shall I" "Dali" pronounced "Dahl eee" rhymes with "Carly"

Interesting. According to Google, it's a British ("Da-lie") vs. American ("Da-lee") difference.

This is all very weird to me because I've always pronounced Dalai as "Dah-lay".

Dalí has an accent at the end, which has the emphasis in the last letter. Dalai does not. They sound very different. “dah-lee” vs https://m.youtube.com/watch?v=JhFbvuKn45w

Hmm. Is Salvador Dalí pronounced differently than Dolly or Dalai? The wikipedia page has "dah-lee" as the phonetic, and https://www.google.com/search?q=pronounce+salvador+dali sounds the same as https://www.google.com/search?q=pronounce+dalai+lama. So it seems like all three are identical.

The emphasis in Dalí is on the second syllable, which is at least different from Dolly. I've always pronounced Dalai Lama the same as I would Dolly Lama, but Cambridge dictionary is saying it should be Da-lay in both US and UK pronunciations.

Tangentially, it seems like most of the results for both searches were autogenerated with TTS programs. I wonder if our pronunciations will shift towards TTS mistakes over time. Probably not, these videos only have a few thousand views, but neat if true.

Dalí has the stress on the last syllable, hence the accent (but Dall-e probably not). In my native language Dalai is pronounced "Da-lie", like another comment says above. TIL Dolly is pronounced so similarly. I thought the Do sounded like Doberman, but apparently not.


Wow, just discovered that the American pronunciation for Dalai Lama is Da-lee. Well, that's a discovery.

This is like when Khan Academy came out and there was a guy online saying it's a terrible brand because it sounds like Con Academy which it doesn't in my dialect.

Took a while to get it.

How do you say Khan?

I found this which matches how I say it (as an Indian) https://www.howtopronounce.com/khan/4145893

It's the KH sound that doesn't really exist in English hence many get it wrong.

The KH is one thing, but for "con"-fusion (hah!), it's also about the "higher" "caan" vs "cawn", which is a very subtle difference.

Kãn / k-ä-n, A like in “father”

Are they? (Not sarcastic, I'm not native and I wouldn't pronounce them all that similar at first sight)

As a native speaker, no, there's hardly any consensus I've seen about how to pronounce them. Certainly there are trends. But I pronounce Dalai somewhere between "Dah-lay" and "Dah-lie", and DALL-E sorta like Dolly ("Dah-lee"), but with a deliberate pause ("Dahl Ee").

I guess after carcinisation comes dolly-fication...

But I do like the hang for whimsical naming schemes in that field. First sesame street characters, now apparently everything sheep...

> are all pronounced the same way

No they're not.

What could go wrong

Anyone care to comment on why the output of these models changes so dramatically given so little Q&A training? It's a 6 billion parameter model with only 50 thousand Q&A samples.

It's clear the model already "knows" the format of a Tweet (short length, attention-grabbing, contains hashtags). The model also knows stuff about language models (word2vec, tokenization), and can include entities from the question in its response (Dolly, Databricks). Yet, it just doesn't put these pieces together in the right way without the Q&A training.

Edit: For kicks, I asked GPT-4 this question: https://imgur.com/a/sM4uyBn

Yes this was a very surprising result... that the relatively small uptraining was able to unlock so much latent knowledge in the model.

Just like in Alpaca, the answer lies in the LLama dataset where the focus is on instruction following. GPT-J knows all the answers but just now was taught to understand the questions thoroughly before writing the answer.

This is really great news and something I felt was missing from the market so far. It seems everyone wants to create `moats` or walled-gardens with some aspect of their models etc.

Nice job DataBricks, nice numbers too. Looking forward to more improvements.

Thought the same until I read this:

> Contact us at hello-dolly@databricks.com if you would like to get access to the trained weights.

https://github.com/databrickslabs/dolly it’s now available on GitHub

That’s the repo with the code to train the model to get the weights, not the trained weights.

Did you try emailing?

"Hi, could I have the weights? I'd like to upload them as a torrent so anyone can download them freely without having to ask so as to broaden access."

Can you guess what their reply would be?

This is not an issue though, they would just be the weights used by DataBricks, there is no reason you can't add your own right ?

Like giving away a website template without the demo content, it's perfectly normal.

See above, there are simply legal uncertainties about commercial use for users, so want to make sure anyone getting them knows this clearly. That said, you can recreate these weights for like $30.

data transfer might actually be the problem there not something like trying to hide the model

bittorrent, come on

context `come on`, what is the point in sharing the weights ?

Someone lay out the reason they should package the weights with this, when they're allowing you to apply your own ?

This repo isn't what you think it is.

‘come on’ meaning that there is an obvious 23 year old solution to data transfer constraints called bittorrent

This is the real risk to OpenAI's business model. If it turns out that you can get most of the same outcome with drastically smaller and cheaper models, then OpenAI is going to have a hell of a time keeping customers around as it will just be a race to the bottom on price and bigger, more expensive models will lose just from a hardware cost standpoint.

No disrespect to the author intended, but the above comment is muddled.

1. OpenAI, the organization, is not equivalent to its chat offering.

2. Saying "the" real risk isn't persuasive. Let's examine many risks before claiming one is the most significant. Also, "real" is this usage often a throwaway (i.e. unneeded) word, in editor speak.

3. Let's talk about OpenAI's "business model" (though such discussions are tricky).

3A. Originally, OpenAI wasn't trying to "hold onto" AI advancements. It claimed to be a broadly funded way to explore fundamental questions of artificial intelligence in a non-commercial, ethical way.

3B. Of course, the above claim was largely aspirational, because it wasn't baked into their DNA in way that could survive the surrounding temptations for more funding, glory, and resources.

3C. Even with their more commercialized model of the last several years, it seems their business model feels like (a) fundraise in exchange for (b) (claimed) collective good open source, tools and shared research.

3D. OpenAI feels to me more and more like a commercial research lab; there does seem to be a lot of commercial partnering with their funding organizations (e.g. Microsoft).

4. I doubt the leadership there views the current ChatGPT models as unchanging. I expect there is a considerable revenue stream around the space. OpenAI is well positioned to play the game several steps ahead of others.

I would frame the broader question this way: for many years, there has been a hunger for this deeper AI research, due not only to (i) the expertise and resources required, but also (ii) to this hope that there is an organization that can maybe keep it within human or ethical bounds.

Unfortunately, this amorphous hope doesn't seem to be matching the actual organizational incentives nor dynamics. It is also unclear how much demand the public in free market will have for nobler research.

My position on these kinds of things is simple: follow the money. If we want an accountable public interest, AI research laboratory it's going to have to be designed, funded, and overseen very differently.

On the flip-side, OpenAI is primed to destroy their competitors. Partnership with Microsoft means they can buy Azure compute at-cost if need be. Their current portfolio of models is diverse on the expensive and cheap ends of the spectrum, with thousands of people on Twitter and HN still giving them lip-service. With dozens of clones hitting the market, OpenAI is the only one staying consistently relevant.

The widespread adoption of local AI won't obsolete a well-priced AI API. I feel like we learned that lesson pretty thoroughly in the SaaS era.

The difference between this and SaaS is that businesses have been moving their (end user) products to SaaS due to wider broadband availability, as well as greed (read: MRR), but on the LLM side, people are building new products with it, so the incentives are to keep your costs low (or free) so you can make more money once you release.

> The widespread adoption of local AI won't obsolete a well-priced AI API. I feel like we learned that lesson pretty thoroughly in the SaaS era.

Unless I am misunderstanding (?), this seems like an overgeneralized lesson. There are many key differences between these situations that make such a connection unlikely. Could you explain your reasoning?

That’s why they are moving so fast and trying to get as much press/media attention as possible.

They want to stay top of mind.

Think about CocaCola, anyone can make a drink just as good. But it’s almost impossible to build their brand and distribution from scratch.

What about the high quality training data that OpenAI has encoded into ChatGPT? Do these other models come close to that?

Why couldn't you just use OpenAI's API to feed prompts and then take the outputs and use them to train your own model to exfiltrate the best features of GPT?

Give it a try if you feel like it is a good thing to do. I'm sure some nation states are doing it.

P.S. this comment does not reflect my personal values. But I would rather someone with values try it almost like a white hat pen test.

Because it would be against their TOS, and things could look ugly, legally.

Is this a bit? If it's illegal to train on copyrighted material, then OAI has broken the law ten times over by training GPT3. There's absolutely zero reason for them to sue, they'll just ban the responsible people.

It's still an open question if any of these models, trained on copyright work, will themselves be eligible for copyright protection.

How many TOS agreements do you suppose they violated while training their models?

I think their TOS forbids using the API for this. I don't think it covers the use of the web interface.


"You may not [...] except as permitted through the API, use any automated or programmatic method to extract data or output from the Services, including scraping, web harvesting, or web data extraction;"

Can’t be automated, so manual extraction is allowed.


That's how Alpaca is made

I wouldn't underestimate the power of momentum

I’d like some clarification of terms - when they say it takes 3 hours to train, they’re not saying from scratch are they? There’s already a huge amount of training to get to that point, isn’t that correct? If so, then it’s pretty audacious to claim they’ve democratized an LLM because the original training likely cost an epic amount of money. Then who knows how much guidance their training has incorporated, and it could have a strong undesirable viewpoint bias based on the original training.

The 3 hours is the instruction fine-tuning. The base foundational model is GPT-J which was already provided by Eleuther-AI and has been around for a couple of years.

Note: I work at Databricks and am familiar with this project but didn't work on it.

Do you know why GPT-J is being used instead of NeoX or any of the other larger open source models?

7B is a sweet spot where you can do something with limited resources both for training and inference. Going beyond that you spill out of an A100 without tricks. We will continue iterating on this with other models.

If fine tuning a small model, which can be run on consumer hardware once trained, provides quality results, why use a larger base model?

It's immediately become difficult to untangle the licensing here. Is this safe for production use - I have no idea if I can expect a DMCA from Mark if I step out of bounds with this or other post-Alpaca models, unless I'm missing something important. Meta really botched the Llama release.

Yes it's nuanced, but will be simplified going forward.

This uses a fully open source (liberally licensed) model and we also open sourced (liberally licensed) our own training code. However, the uptraining dataset of ~50,000 samples was generated with OpenAI's text-davinci-003 model, and depending on how one interprets their terms, commercial use of the resulting model may violate the OpenAI terms of use. For that reason we are advising only noncommercial use of this model for now.

The next step here is to create a set of uptraining samples that is 100% open. Stay tuned.

Are you in touch with the OpenAssistant team? I believe they already have a more or less complete set of samples (100,000!) that were produced in an open environment and aren't encumbered by any licensing.

No I haven't heard of that, we'll engage with that team. This is exactly what we need will look into it.

This has nothing to do with facebook. The foundational model here is GPT-J which is opensource and safe to use. Sadly, it is inferior to state-of-the-art models such as LLaMA.

But they're "using data from Alpaca". I don't know what that means, isn't Alpaca using data generated by ChatGPT, which isn't "clean" to use? Or data from Facebook, which isn't "clean" to use? I'm drowning.

They are instruction tuning it using the dataset released by stanford-alpaca team. The dataset itself is synthetic (created using GPT-3) and somewhat noisy and in my view can be easily recreated if OpenAI ever tries to go after it (which is very unlikely). Anyway, facebook has nothing to do with anything used by this project.

So, this is a "dirty" model, in that is was created by data which violated OpenAI ToS. Obviously, this kind of violation is basically fine if you're a massive corporation who the rules don't apply to, but it's a huge risk if you're a small fish.

"basically fine if you're a massive corporation who the rules don't apply to, but it's a huge risk if you're a small fish"

With these things, it is usually the other way around.

If you are a small fish, no one will care. But if you are big enough, that money could be extracted from you, then they will come. A big org just has better lawers and negotiating power, but they really cannot ignore the law. Especially not, if there is a competitor with money to sue.

So if you are small and want to become big, better be cautious on the legal ground you are walking.

ToS are not the law. It would be similar to your power company claiming copyright over the code written using "their" electricity. Not going to happen. I wouldn't be too concerned.

No, but you could be banned from using OpenAI products in the future, which seems like quite a liability for a researcher or company.

That would be anticompetitive practice that is actually against the law in many countries[1]. In the unlikely event of OpenAI ever engaging in such things they will be sued into oblivion.

[1] https://en.wikipedia.org/wiki/Refusal_to_deal

No it wouldn't. Wikipedia has a crap definition that inexplicably focuses on cartels where multiple companies coordinate the refusal, which this definitely isn't. The FTC has a better definition for US law [1].

Companies routinely ban users for ToS violations. Just look at any thread about Google on here to see people complaining about it.

[1]: https://www.ftc.gov/advice-guidance/competition-guidance/gui...

The FTC link has an example of the only newspaper in town refusing to deal with customers who are also running ads on a radio station. Do you think if the newspaper dressed such refusal as a ToS violation it would fly with FTC?

Google might be banning people for enforceable violations of their ToS but imagine the uproar if they banned a Bing engineer for using Google search to find solutions for some Bing problem (which is similar to the problem here). The upside for Google or OpenAI would be somewhat limited but the downside is almost boundless.

Especially when OpenAI explicitly doesn't have a claim to copyright on the model output.

If you use output, from a non-profit who open sourced the output gained by following the TOS, as in they aren't using it 'for profit', it's not illegal, because:

A. it's an output gained via following the letter of the law (TOS).

B. TOS only applies directly to people who've accepted the TOS, unless alpaca's license/TOS ALSO forwards the same criterion as it's source at openai, then derivatives wouldn't apply.

It's like if an app developer on IOS violated a TOS, and apple tried to go after everybody who ever used the app, they didn't agree directly to the TOS, only the developer did.

That's between OpenAI and the people that recorded the data. No one else needs to care.

I don't know the full details but Alpaca is from Stanford and only based on the LLamA (not a derivative work afaik). That said :

Also Meta's licensing here https://github.com/facebookresearch/llama/blob/main/LICENSE

Can't be sure what that license actually reffers to, the language model or just the tooling in the Git Repo.

I agree its a minefield, but with Meta I would eer on the side of caution.

Why? Dolly had nothing to do with Llama or its weights.

Besides: How would anyone ever know which model generated the output you are serving? AFAIK there is no fingerprint in any model’s output. And even if there was, it would probably be destroyed by fine tuning “over it”.

> AFAIK there is no fingerprint in any model’s output.

It seems like there easily could be. What if some of the data they trained it on didn't exist anywhere else except in the training set, and was put there specifically for this purpose? For instance they could have taught it a few poems that don't exist anywhere else. If you can coax the LLM of unknown origin into reciting those poems back to you, you know where it came from.

Even easier have a small set of 8-10 character gibberish tokens it's trained on in a particular contexts (eg a non-existent poem). Then feed it one or several poems and see if a gibberish token pops out.

I think they call these canary GUIDs. If you manage to generate one from an LLM then you can conclude with certainty that the model saw that document during training.

> Besides: How would anyone ever know which model generated the output you are serving?

There's precedent for "whatever you can get away with" in tech companies, but establishing a culture of that at the start of this new big change could end up undesirable for most people.

For example, it could relieve demand for more legal and sustainable ways, until it's too late. (Look at the history of digital entertainment media piracy and DRM and legislation, for example. Or look at the history of software piracy, where some big companies seem to actually want their product to be pirated, partly because it builds a bigger moat against competitors, and they can legally strongarm some of those pirates later.)

> Meta really botched the Llama release.

It's no surprise really though, from what I see they recognised some way to monitize and rolled back their commitment.

But this Dolly doesn't depend on Llama (unless I'M missing something), so you don't have to use it.

Given that Alpaca strictly specified that they released purely for academic use and any commercial use was prohibited given doing so would violate terms of service, I don’t see this as viable for use. Looks like marketing gimmick

Fine-tuning these models reminds me of the good ol' days with tube TVs where the slightest twist of the vertical hold dial meant the difference between a clear picture and useless, dizzying, visual nonsense.

Open Assistant is doing the same thing, but actually creating a dataset that isn't on questionable legal grounds by creating a gamified web app where people can contribute: https://open-assistant.io/dashboard

I wonder how small can these models get? From 175B to 6B with comparable performance is huge, but can it go lower?

I don’t love the lack of quantitative comparison to Alpaca but a commercial model (which sounds like it’s in the works) would finally move the needle on democratizing access to LLMs.

Will also commend the authors for not falling into the “LLMs can’t perform without 200B params!” fallacy. For anyone reading, 6B params is enough to train on a 3090. A PC rig for training or running inference with this would put you back maybe 4k$.

The end game here is likely getting the model to perform well in millions of parameters on specific tasks. Most business uses of ChatGPT are pretty closed domain tasks, it wouldn’t be a huge step to distill this model on a specific task and get it down to 150-350M params (which is roughly BART size and can run on AWS Lambda even).

Its trained on alpaca dataset which in turn was generated from open ai davinci, wondering if it is actually transferring the weights by generating content from the source model?

Interesting to see how Dolly has improved/leveraged an off-the-shelf older model (https://huggingface.co/EleutherAI/gpt-j-6B) with dramatic results.

I see that in its five book suggestions that it has suggested you should read Hitchhikers Guide twice.

Not many humans would even get this answer correct.

I am impressed.

It's a great example of how LLMs can be fine-tuned for specific behaviors. The gain in output quality relative to the expense of tweaking an older, relatively small model is quite impressive. Dolly may not be the Bugatti of LLM chatbots. But it illustrates that high performance is increasingly within reach for the rest of us.

How hard would it be to embed this into a NPM module so anyone can use it in their servers / apps locally?


Download GPT-J-6B from Eleuther

Download Alpaca Fine Tuning Code + Alpaca Examples

Train for 6 hours or so.

Get vaguely good RLHF model

Key point is vaguely good. Scale is still important and that manifests in the difference between gpt3.5 and gpt4 based chatgpts. It's qualitatively and quantitatively so much better in pretty much every benchmark. There is no way around the bitter lesson.

Isn't it the case that we literally have no clue how GPT4 and GPT3.5 are different in terms of training, given OpenAI doesn't want to disclose anything at all?

It's not true we know nothing. We know a little bit by using the two models from their API. Given the time per inference and the limit on messages per day for GPT4, I'm willing to bet it's doing around 10x more compute than GPT3.5. If that's because it has 10x more weights, I don't know. But it wouldn't be a terrible guess.

So your estimate is that GPT4 has 1.75 trillion weights?

Is there anything that affects inference compute time besides the number of parameters? Assuming same hardware, etc.

Yes - for example adding memory to the attention mechanism (similar to RETRO or Memorizing Transformers paper)

We don' have the details, it is true. But empirically and based on their report gpt-4 is notably better than chatgpt.

Better, yes, and for that we have evidence. But is the improvement stemming simply from even more data? That's what I'm questioning.

This paper is pretty approachable and goes over the "scaling laws" in detail: https://arxiv.org/abs/2206.07682

In short, yes. More data, higher quality data, more epochs on the data. That is the name of the game.

That paper doesn't discuss GPT-4 at all. It does however contain this interesting excerpt (emphasis mine):

> Although we may observe an emergent ability to occur at a certain scale, it is possible that the ability could be later achieved at a smaller scale—in other words, model scale is not the singular factor for unlocking an emergent ability. As the science of training large language models progresses, certain abilities may be unlocked for smaller models with new architectures, higher-quality data, or improved training procedures. For example, there are 14 BIG-Bench tasks5 for which LaMDA 137B and GPT-3 175B models perform at near-random, but PaLM 62B in fact achieves above-random performance, despite having fewer model parameters and training FLOPs.

So it's not obvious that it should be so straightforward.

It's speculated it has same number of parameters, but more compute and is multi modal.

Free is better than $$/token imho.

If you have a use case or a bunch of disposable income then go with the “bitter” one.

> There is no way around the bitter lesson.

Isn't there? I'm certainly not sure, based on the results published over the last weeks and months.

The giant GPT-{3.5,4} models show that if you make the model big enough and throw enough data at it you can produce an AI capable of conversing on basically any topic, in dozens of languages. There are plenty of different takes on how near-human its abilities are on specific tasks, but it's worth stepping back and appreciating how super-human the breadth of this knowledge is.

But it's also not clear if a mega-model is anything close to the most efficient way of storing knowledge. After all, you don't need to memorize every fact in Wikipedia if you know how to effectively search it.

And we're currently seeing a daily explosion in these capabilities. Today's flavor is interfacing with Wolfram, but we've also seen web searches, python coding, etc. That, I think, it the real superpower that comes out of this: you or I can answer a question by "doing a web search" or "query a database" or "use wolfram" or "develop a python program that finds the answer" However, an AI could do tasks like this just by "thinking" about it. Maybe it would be as natural as we find blinking.

That to me is the real breakthrough in stuff like Alpaca -- start with a mega-model and prompt it with something like: "After this paragraph, you are going to be speaking to a AI model similar to yourself but much more primitive. Its task will involve interfacing with English speakers, so converse with it only in that language. It has access to the same {X,Y,Z} APIs you have so any time it has trouble answering a question, prefer to give hints about how it could find the answer using those APIs rather than providing the answer directly yourself. Only give an answer directly if it repeatedly fails to be able to answer it by using an API. I've provided a large set of standardized tests used by humans at this URL -- start by asking it questions intended for a preschool-aged child. Each time it is able to answer new questions at a given level correctly 99% of the time increase the material's level until it is able to achieve that score on a test designed for a Computer Science PhD candidate"

How large would the "student" model have to be to succeed at this deep but narrower task? I think the answer right now "we have no idea". However if the model has the advantage that it can rely on external knowledge and tools from the start (and is rewarded by the "teacher" for doing just that) I bet it'll be a lot smaller than these mega-models. Sure, you wouldn't be able to disconnect the "student-AI" from its APIs and expect it to converse with you in Hungarian about the history of yacht design, but that might not be a capability it needs to have.

My personal hunch is that we're going to find these "AI-taught specialist AI, with API access" models will be a lot smaller than most people are expecting. That's the moment when things REALLY change: instead of pairing a human with a mega-model AI, if specialized models are cheap someone can say "spin up 100K expert-programmer AIs and have them supervized by 5K expert-manager AIs and have them build XYZ"

Or if you need it to work on an existing task you'd specialize further -- you'd go to your AI vendor and say "I'd like to license the weights for your expert-programmer model, but first have it read these 200 books I consider important to my problem domain and then show it every commit ever made by a human to my git repo and every design document I have"

> you don't need to memorize every fact in Wikipedia if you know how to effectively search it.

yeah you're onto something. models good enough to sustain a conversation where I bring my own data as a primer are probably more useful that models that have a frozen knowledge of everything. the killer feature of gpt-4 is the 32k token size, which allows unprecedented amount of input to be fed into the knowledge graph and queried.

Very good analysis. I disagree with a fundamental point though: If you don't consider compute cost and just want the best possible AGI, then there's nothing stopping you from supercharging the mega-models with the same capabilities as the smaller models - and if the current scaling shows anything, the mega models will just become even better.

Sometimes you do need to consider compute cost, say if you want a small but high quality model that can run on a smart phone to perform a task. For example, with camera input, identify a plant or animal, while in a remote area with no cell signal, so it has to yield an answer without communicating with a server. What's the smallest, most efficient model that can do that effectively? Build that.

> If you don't consider compute cost [...]

Yes, but what if you do? Imagine your hyper-specialzied API-heavy model takes 10x less resources to answer a question (or at least a question relevant to the task at hand) Won't it be more powerful to have a model that can run 10 times as fast (or run 10 instances in parallel)?

What if the ratio turns out to be 100x or 1000x?

So I agree that the cutting edge of "best possible AGI" might mean building the largest models we can train on massive clusters of computers and then run on high-end hardware. My hunch, though, is that models that can be run on cheap hardware and then "swarmed" on a problem space will be even more powerful in what they can perform in aggregate.

Again, it's just my hunch but right now I think everybody's predictions are hunches.

I'll actually go one bit further: even for a linear task that can't be "swarmed" in the same way, it could be that cheaper-per-token models could even do better on linear problem-solving tasks. Existing models already have the ability to use randomness to give more "creative", if less reliable, answers. This is inherently parallelizable though -- in fact Bard seems to be exposing this in its UI in the form of multiple "drafts". So what if you just ran 100 copies of your cheap-AI against a problem and then had one cheap-AI (or maybe a medium-AI) judge the results?

Or at the risk of a getting too anthropomorphic about it: imagine you as a human are writing a program and you get stuck on a tricky bit -- you know that the problem should be solvable but you've never doing anything similar and don't know what algorithm to start with. Suppose then you could tell your brain "Temporarily fork off 100 copies of yourself. 10 of them go do a literature review of every CS paper you can find related to this topic. 10 of you search for open source programs that might have a similar need and try to determine how their code does it. The other 80 of you just stare off into the middle distance and try to think of a creative solution. In two human-seconds write a summary of your best idea and exit. I'll then read them all and see if I/we are closer to understanding what to do next"

For us, this type of mental process is so alien we can't even imagine what it would feel like to be able to do. It might come completely natural to an AI, though.

"ChatGPT, a proprietary instruction-following model" pun intended.

I always asked myself if ChatGPT will be the only (paid) solution. Thank you Databricks for being an open source competitor here

What a great time to be in this field. It’s advancing so quickly!

Very cool that this powerful new tech is within reach to the masses in a platform like Databricks!

Interesting. Allows each organization to develop their own chatgpt model for a fractional cost.

Interesting. This allows organization to build their own chatgpt models at a fractional cost.

That's awesome, one step forward into democratizing data and LLMs, now, how do we get access to the weights?

With 1 server? Exciting times for broader llm adoption

I think this is cool, but it's on the range of complexity that I would expect from a personal project. When you put a whole organization behind it, I feel you could have provided something extra - better datasets? Improved weights from a ton of training?

Looking forward to trying it out :)

Pretty impressive stuff !!

How did they do this with $30 only?

that is incredible to me, i can't wait to see the rest of the progress of this project.

here come the "Me Too!!" announcements from everyone trying to catch some of the energy of this new market

how long until IBM, Tesla and Oracle announce Me-Too LLMs?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact