Hacker News new | past | comments | ask | show | jobs | submit login
How to train your own large language models (replit.com)
283 points by benwen on April 20, 2023 | hide | past | favorite | 60 comments



Ghostwriter is notably worse than GPT-4, so while it may be true in a sense that "Training a custom model allows us to tailor it to our specific needs and requirements", the reality is they'd be getting better results just using OpenAI right now. Probably true for almost every other use case.

That said, I am patiently waiting and champing at the bit for the day this isn't true anymore. Cool to see the groundwork being laid for it.


Stable Diffusion 1.5 is not SOTA, but in reality the sea of augmentations makes SD kinda unbeatable, if you are willing to put in the work to use them.

I think LLMs could end up the same way, if the comminity consolidates around a good one.


Not everyone wants to depend on and trust a cloud service, and not everyone needs GPT-4 quality.

If there's a viable way to tune and run models locally they could still be useful if you don't need it to play chess and imitate a Python interpreter at the same time.


Is it possible to add to an LLM without re-training it, my understanding was no.


The original “pre-training” Is what’s expensive. The “fine-tuning” (also training that it modifies network weights) for instruction following or other tasks costs the thousand dollar range.


If one of your specific needs and requirements is that you do not share data with OpenAI then this is a viable option.


just my 5 cents. it should be easier to train small custom model which works off a big pre-trained one. getting latent state as an input. while big model does all the hard work. but, getting latent means it should be accessible. that's why open source models are so valuable, even if they are not that good in general. more over, open source models can be used in other projects in various setups.


They’re competing directly with Microsoft (and getting crushed) because GitHub is their biggest competitor, so it makes sense that they wouldn’t want to use OpenAI products.

Agree that Ghostwriter is subpar though.


That could all change in a few months. We saw locally runnable, open source image generation catch up quick.


This is interesting. I think the future of AI will not be re-creating something like ChatGPT, but using variations of these methodologies to train AI models for specific tasks.

There are some advantages to not having to make an LLM that impresses every human being on the planet. Imagine training the AI to be good in only one specific thing. I think it will become much more precise and deterministic.

This is just my hypothesis. I'm excited to see where this goes.


There is plenty of less attention-grabbing work being done on "domain specific LLMs" like BioMedLM[0], Med-PaLM[1], BloombergGPT[2], etc.

That reminds me - I saw a somewhat-clever acronym variant for LLM that communicated this the other day but it escapes me ATM...

[0] - https://www.mosaicml.com/blog/introducing-pubmed-gpt

[1] - https://cloud.google.com/blog/topics/healthcare-life-science...

[2] - https://dev.to/reaminated/thoughts-on-bloomberggpt-and-domai...


Yeah that's awesome. I honestly think the next 'leap' in AI will come from these 'domain specific' models.

Also I'm not talking about just 'prompt' output model. These ones are great and I'm sure they will be extremely impressive. However I'm talking more about being able to 'operate' something.

Imagine this, an AI able to operate some specific API in a deterministic / reliable way. I'm talking about complex operations.

SO the output is not so much a text prompt but an SOP and then actually Operating the SOP.

Imagine going into an app and say "can you boot up a cluster on AWS and run a wordpress site, point domain example.com to the site".

Imagine this "you know my database for app X, what was the latest snapshot", it replies with the date / time of the snapshot, and you reply with "can you move that snapshot from google cloud and create a new database from that snapshot on AWS cloud?", and it does it for you.

That's what I look forward to.


So training an LLM on OpenAPI specs ;)?

It actually seems like more of a task for good 'ol fashioned NLP (intent recognition) with some wiring for all of the connectors...


https://news.ycombinator.com/item?id=35634120 (LlamaAcademy: Teach GPTs to understand API documentation with LoRA) ?

https://github.com/danielgross/LlamaAcademy


> Imagine this, an AI able to operate some specific API in a deterministic / reliable way.

It doesn’t seem like LLMs are going to be able to do this, unless the application has a high tolerance for mistakes.


Maybe something new needs to be invented? Or utilize something we already have? Anyway I think this is the next leap in AI.

Operative Language Models. Where the model is trained to do specific tasks very well. In a reliable way. I don't anticipate them to be 'large' and expensive also.

Then we have lots of these models and some kind of 'orchestration' layer that makes them all work together. This I believe will be the future. Micro Operative Language Models.


The next leap will be decided by what someone is able to actually implement, not by what anyone thinks it should be.


Yes! This is where I'm at!


LangChain


I'm wondering about just that, I want to have a minimal model with little overhead that I can train to a specific body of texts but it doesn't need to know about all the rest. So basically it should be good at conversation and able to learn what I teach it, nothing else.

But I'm having trouble finding resources about how to achieve that.


How expensive is it? My understanding is that it's not reasonable to train an LLM from scratch by yourself, and that if you want one that isn't just very stupid then you need to spend between hundreds of thousands and hundreds of millions of dollars. But if you don't want to train from scratch then you can fine-tune existing models for cheaper.


Disclaimer: I work for MosaicML (MosaicML is the creator of the training platform used by Replit).

Training these models from scratch on your domain specific data is not as expensive as one might think. We have provided some cost estimates in our blogs.

https://www.mosaicml.com/blog/mosaicbert

https://www.mosaicml.com/blog/training-stable-diffusion-from...

https://www.mosaicml.com/blog/gpt-3-quality-for-500k


Do you have any examples on how to train a model that can write code but in a specific domain? Eg I only want to train it on a specific set of code. Eg let’s say functional React components in TypeScript.


We recently released 1B parameter model trained on a mix of data.[1] If you got your domain-specific data, our platform can cover the rest.

[1]: https://twitter.com/jefrankle/status/1649060478910357504?s=4...


But do you have any examples of how to do this? I am a pretty seasoned dev, but never trained a model before :)


Thank you this is very interesting!


Looking at what they're doing here probably not as much as you think.

As you note, with the plethora of open/open-ish LLMs today and LoRA + PEFT you can fine tune with low VRAM and pretty quickly so even a single A100 or whatever cloud GPUs are just fine. I've even seen people pull it off in reasonable time on super cheap T4s, A10s, etc.

I doubt anyone reading a blog post is attempting to train a "true" multi-billion param LLM from scratch.


[flagged]


Ouch.

Ok, I'll try again because I read the article and watched the demo video from Mosaic:

- MosaicML claims to have some magic that efficiently autoscales with awareness of specific instance per hour pricing.

- MosaicML claims to have auto-optimize magic that makes training 3-7x faster.

- The incentives seem to be aligned because their value prop is likely "we'll markup 10% but save you 50%" (or whatever).

It's too bad Replit didn't provide the costs for this post in the same way Stanford, etc have done with Alpaca, Vicuna, etc. They probably didn't for from-scratch training because the answer to your question is almost certainly "it depends" or "if you have to ask you can't afford it" (for now).


Thanks I understand better now. Sorry my other comment was rude. Probably I'm just frustrated that I will never be able to make a sentient bot because it will cost literally a billion dollars, and I just have to sit on sidelines and watch the billionaires play with their sentient bots.


Shameless plug: I built a toy tool to really quickly get into llm finetuning by just pasting your training samples into a textbox:

https://github.com/lxe/simple-llm-finetuner


This is brilliant! Thank you for creating and sharing this.


I am leading a similar initiative and I have also used Databricks for the preprocessing.

Most interesting is what happens between the preprocessing and the model training - the hand-off to the cluster workers.

I guess the efficient option is to partition the data, set up shards in-advance and ideally cache or even copy the data to local workers during init.

This, of course, breaks some of the promise of being able to scale training flexibly, for instance to experiment with the scaling of compute and data.

A different way to go about it is to use a streaming/iterable dataset/loader implementation with its own sharding logic that reads from a central store of parquets with some reasonable row-group size. This gives full flexibility in terms of node/gpu/worker/batch_size for experimentation - e.g. literally as parameters in PyTorch. Of course, one has to also implement caching of remote data since the data is kept centrally.

In my opinion, there is no satisfying/flexible solution for this, especially when one also wants to experiment with complex transformations or augmentations in the dataset/loader and remain portable across cloud offerings. So, this has to be implemented from scratch (not too difficult but still a lot of code). The coming datapipes also probably make this trivial.

Would love to hear more experiences in how you set this up!

Edit: I guess for NLP this is a good implementation and what Mosaic uses https://huggingface.co/docs/datasets/stream


> "a student coding on their phone in India should have access to the same AI as a professional developer in Silicon Valley. To make this possible, we train custom models with reduced cost."

In principle, that's great. But the reality is: whoever has the resources and benefits from something better will look for ways to get it. What they're communicating here is: the most resourceful developers on the planet aren't our ideal customer.


Is this just a high level description of the process? There isn’t really any actual code to run or tutorial to follow.


Yes, they mention they plan to publish articles with specific details later on as well as to open source some models.


Did the blog mention any metrics like what the model size is etc... since that seems to be the one of the motivating factors?


Did we ever get any resolution about what happened after this company threatened to sue their intern for making a side project that supposedly stole all their great ideas? I would like to know before I ever consider anything from them again.


The founder admitted his mistake and the ex-intern's site is back up and running https://riju.codes/. I'm personally a fan of both Amjad's (CEO) and Radon's (intern) and realize that everyone makes mistakes. It's not a reason to discount the hard work of the people at replit.


A badly behaved CEO is absolutely a reason to avoid using a whole company.

Reading through the entire story leaves me with a bad taste in my mouth, especially this bit: "still refused to list any specific part of Replit he thought I had copied, even when I asked him for such details multiple times during the phone call, despite his continuing to claim both privately and publicly that I copied Replit unethically".

I haven't used Replit, but reading about it and looking at riju.codes, I have a hard time believing that there was any secret sauce that was inappropriately used, and the sketchy refusal to give details makes me think it's more about a CEO establishing dominance over the little people than any serious IP concern.


It ends with another threat - it's ok you copied, but don't copy more things (what things?)


The CEO refused to apologize, and instead doubled down, taking advantage of a massive power differential between himself and a random college grad. He only apologized when the differential evaporated after the post hit the top of HN with something like 3000 points. I don't know about you, but I don't find that to be particularly acceptable, nor a "mistake", and I'm happy to continue to punish a CEO's unethical behavior.


Sounds pretty toxic


Why is he toxic? reminds me of those media influencers calling for others to lose their jobs, freedoms and lives because they wouldn't take the vaccine.

I'll never forgive them for their toxicity and I don't think that makes me toxic.


That’s a very generous interpretation of what happened because it wasn’t a “mistake” when he threatened the intern, it was something he purposefully and intentionally did, and doubled down on, even after having significant time to reconsider. Only when there was widespread public criticism of his actions did he backpedal.

I’m curious what he’s said or done to make you a fan?


Seems like you're just arguing about the definition of the word "mistake". Intent has nothing to do with it. From Google (Oxford Dictionary):

> an action or judgment that is misguided or wrong.

So to admit that you made a mistake just means you were "misguided or wrong" which he definitely made clear he was. You're claiming that there was significant time to reconsider but the reality is that this all went down in a matter of hours from when the intern published his article. Sometimes people have bad judgement and the public and especially one's peers can help to see the error of their ways and improve for the better. If he was a repeat offender and this happened several times then I could totally understand it but there's no reason to keep bringing this up every single time anything related to Replit is mentioned on HN.


No, whole thing didn’t go down over a few hours. It was weeks. Not sure why you’d lie about this to protect someone you don’t know.

Furthermore, not only did did he refuse to acknowledge his lies, he continued to lie, doubled down on the previous ones, and to this day still continues to make deceptive statements.

So yes, as his slimy behavior has continued it is relevant to bring this up every single time replit is mentioned here.


Mistakes are not necessarily accidental or guilt-free. The CEO absolutely made a mistake.


Thanks for linking this. This is actually a superior offering to replit. They recently removed the ability to access a simple repl without logging in. Now you a) have to login and b) have to deal with this obtuse IDE-in-a-browser project creation shit. It's so many extra steps before I can run code.

I just want a URL in which I can run some code. https://riju.codes/ is literally that. Thanks!


I ran into the same thing and finally made a Replit account. I'm just gonna use Riju from now on though. Using Replit with an account is way more janky than it was without needing to login.


> Using Replit with an account is way more janky than it was without needing to login.

It’s such a massive miss by their product teams. I don’t need this half-baked IDE. I want an interface that lets me run code as quickly as possible without any intermediate steps.


> I’m personally a fan of both Amjad's (CEO)

Why?


Not at all familiar with the details of this, but just to generally observe the bigger the mistake - the harder the walk-back and the higher the chance a lesson was learned. No guarantees, but it's rare to get past the first stage.


If you read the blog post from Radon, Amjad's apology comes across as "sorry I got caught" rather than any actual remorse.


Story for those who didn't see it: https://intuitiveexplanations.com/tech/replit/


That's weird. I would never do anything even remotely similar to what my (ex) employer does. CEO sounds like a douchebag tho.


I've seen terms/clauses here in AU for full time employment, depending on the industry/niche, where you can't jump to the same industry within X months.


That's what happens when you have a society worried about money and not interested in true human development.


This CEO also has a history of punching down on Twitter. A very bad look.


Shh, another post on the home page is about them hiring (YC W18), don’t interfere with the business model!


That's just a coincidence. The job posts go into a queue and it's semi-random when they get placed. The current submission appearing at the same time is unrelated. At least I presume it is.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: