Hacker News new | past | comments | ask | show | jobs | submit login

I have high hopes in dataset engineering. Scaling the model is already reaching the limits of hardware and funding, the training corpus is almost all the text written on the internet (by some accounts just 10% of it, so maybe we got a couple more years left there).

So then what can we do? We turn to the quality of the data. Input examples can be enriched, facts can be checked for support. Based on existing corpus of text we can generate new training examples. Some of them, like code, can also contain execution data and tests.




Information is not just the public internet. Every text message, school paper, phone call, and recording of an individual could be future input. Then expand that to photos and videos. Then consume all of YouTube.

Each of these sources is still rapidly growing. We are far from drying up the well of the human experience.


We are a long way from capturing the richness of human experience. Your assumption is that everybody records everything all the time but they don't. Like linkedin never has anybody failing at anything.

It's going to end up a simulation of what people think other people want to hear, rather than what actually happened. The experiment has already failed because the assumption that the internet has captured everything is wrong.


Asserting that all our audio won't be captured is yet another assumption. The amount of data being collected is ever increasing, and it is conceivable that near unlimited microphone/camera access and storage could lead to this in the near future.

Either way, it isn't static or decreasing.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: