This is awesome! Since some years ago we don't see NLP things that we can run in...

razodactyl · on March 18, 2024

RE: How much is the training dataset.

It's probably over 50GB so it's very unlikely I will be able to overfit the model, the best part is this means the model is always learning new things and showing new abilities.

I noticed that straight after giving it ONLY a coding dataset it became a bit better at logical puzzles so I think there's a side-effect of training an LLM on sequential information like code.

The LLM seems more inclined to learn structure for example: Structure of language Structure of poems Structure of music lyrics etc etc.

razodactyl · on March 18, 2024

3080Ti (12GB VRAM), DDR4 64GB AMD Ryzen 7 8TB drive-space (I built it with the idea that I would be hoarding a lot of data for potential AI research)

---

The model has probably spent ~2 weeks straight of training to get it to this level (it learns VERY fast I suspect from GQA+ALiBi + the type of training I've given it)

It's why I think there's a considerable amount more training it can do.

nickpsecurity · on March 18, 2024

You can also use Project Gutenberg if you want a huge, legal dataset.

I’ve been collecting papers on small models, training with small data, using small ones to jump start big ones, alternatives to back propagation, etc. I wanted to do a sub-500M with those techniques eventually which could be reused in other projects. I may or may not get around to it.

Email me and I’ll send you some of those links for your research.