Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> pushshift.io, a website and database which logs of all of the posts that go on Reddit when they get posted

Such a great resource. It's surprisingly easy to build your own massive datasets using it. I re-derived WebText2, used for training GPT-3, just on a home machine. And with some image scraping you can build up image datasets for training interesting GAN models.

> the training process they used are not.

Seems like it'd be fairly straightforward to finetune an existing language model . GPT-3 if you've got spare change, GPT-J-6B can be finetuned in Colab for free, and GPT-NeoX-20B could be finetuned for free/cheap. Use simple concats of AITA posts followed by a top comment. Balance for NTA/YTA like the Training Data page mentions, and I'll bet you'll get comparable results.

That said, the _idea_ of this bot is really cool and fun.



Straightfoward to tune, but given the dataset size it would require a substantial amount of compute, more than what a Colab can provide without timing out.

The comments by the creators imply they used some sort of SaaS for both training and deployment.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: