The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.
We have already seen impressive progress (language models, GANs...) but much more work remains. Models requiring labeled data or only working with toy data (even if unlabeled) will soon become irrelevant.
I would expect that this will remain a per-task question for a very long time, maybe even forever. For example, I was amazed at how elegant https://openai.com/blog/unsupervised-sentiment-neuron/ is for an unsupervised model for sentiment analysis.
I think there are numerous similar application in text understanding and I would not be suprised to eventually seen many sota systems learn in an unsupervised fashion.
Still, there remain many other tasks (in general, but also for text understanding) where this and similar strategies do not seem applicable to me -- or at least where I cannot directly imagine how they should be.
I would not be suprised if we end up with many, many successful strategies to apply unsupervised learning -- many completely different, and thousands of problems still being used with the help of supervised learning. Not with a "company to figure out" a general case
This sounds very extreme to me. Sure, unsupervised learning is already bearing fruit (most obviously in NLP) but there are domains where labeled data is unlikely to disappear soon - in many cases semi-supervised learning is a more reasonable goal which will reduce the need for labeling (or labeling-related tasks).
>The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.
This is an even stronger claim. Why are you so sure only one actor will be able to leverage it? Yes, it will be a huge boon but claiming that the first to crack it will 'win' seems very farfetched.
Care to present evidence for that conclusion or better define the value of “soon”? Most work I am aware of in my own field still requires copious amounts of training data – even with the progress that you reference – and generalisation is still a big challenge.
Well, in many tasks, this is already solved. E.g. speech recognition or translation. Or just think about language modeling.
The main problem is mostly just computing time and/or computing resources. We have enough supervised training data to train our model for weeks. We basically have an almost infinite amount of unsupervised data. So yes, we could in theory make the model bigger and then train it for months or years. But that's not really practical currently.
re: client trust: compare to systems like Mechanical Turk. An established data labeling company can monitor what employees are doing, provide ethics training/warnings, etc.
I hope at-least data-set labelling empolyees are in a better position as they are expected to have better skill set. This is a better job than illegal/unethical farms.
I guess it's not entirely contradictory as it comes under outsourcing haha..
Also Im involved a bit in making automated captcha systems for.. ahem.. research. I've noticed a lot of the manual service providers in China have already pivoted to data labeling services. Most of them probably already have the workers, infrastructure, and pipeline in place, which is why I think original poster makes a good point
Besides, how often do you see captchas anyway? If you're not using super privacy / tracking protective browsers, they'll remain hidden for the most part, and the ones you see are the simple 'check this box' variety.
It literally breaks the internet for me, I had to go through recapchas 12-15 times a day.