What does it cost to train a 6.7B transformer from scratch. Not considering any data preparation because that would be highly variable. Is this realistically possible for mere mortals? How long until it'll become a national past time?
no, you never should pre-train your own LLM unless you have 100k$+ to spare. You should only fine-tune. There is no reason you can't just fine-tune with whatever data you have
I have a huge company internal dataset with domain specific knowledge. What you are saying is that I can just fine-tune an existing model with that data and be fine?
That was exactly our inital idea but from all I learnt when trying this is a dead end approach. From my understanding the consensus seems to be that fine-tuning works well to alter or restrict behaviour but very badly to teach additional knowledge. You can fine-tune a generic base model with generic chatbot but not into a domain expert.
That also seems to be the reason why people still use vector databases for large domain knowledge data. I'm aware that the vector database approach has different pros and cons but if fine-tuning the whole content would be possible we certainly would use it in addition to that.
I'm not an expert, so I'd appreciate any comments, hints, pointers and corrections if I'm mistaken in my understanding.
And my original question still stands. 100k$ is not a lot for a company, it must certainly be more than that?
pre training and fine tuning use the exact same method of next token prediction. the difference is in the quantity of data you have (& whether the model is pre trained).
you need to train the model on 1 trillion tokens (https://platform.openai.com/tokenizerhttps://github.com/google/sentencepiece) anyways for it to get reasoning capacities, which it feels very unlikely that your data is that much.
I'm highly skeptical that you have enough data to pretrain if you don't have enough data to fine tune.
fine tuning + vector search + prompting of as much stuff as you can, on a LLM like palm2 or gpt4 is what I would do. otherwise you can use falcon 40B ofc.
The data is not the problem. I could train on any of the public datasets combined with my own data. And here comes my point:
The result I'd achieve by training on that combined dataset from scratch cannot be achieved cheaper by utilizing an already pre-trained model of the huge generic dataset plus whatever additional training with the large domain-specific dataset.
From what I understand:
If you fine-tune only with the domain specific data and just enough so that the model picks up that knowledge it will have forgotten most of its generic knowledge already.
If you train on the combined dataset it will take as many epochs for the domain knowledge to shine through as for training the original model. It would cost the same as training from scratch.
You need enough tokens but you also need to train your model
the right amount on them. Too much or too little training is bad and the base models we have are just trained for the sweet spot and will tolerate a little bit of fine-tuning to adapt their behaviour but not nearly enough to teach them new facts.
I'm not an expert, so I could be wrong. What makes me a bit confident is that I have not yet found a single project that reports to have used the approach you suggest successfully.