Hacker News new | past | comments | ask | show | jobs | submit login
India’s data labellers are powering the global AI race (factordaily.com)
141 points by yarapavan 9 months ago | hide | past | web | favorite | 34 comments



This is the new back-office for AI processes and looks like India is tapping into it just at the right time just like the BPO boom in the early 2000's.


Unsurprisingly since they have had a strong tech workforce with cheap labor cost and an advantage in language (English). It will be interesting to see them move up the value chain and the disruption it may create.


Unsupervised learning is the dark matter of AI, as LeCun said. We cannot rely on labeled data, which is expensive and scarce.

The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.

We have already seen impressive progress (language models, GANs...) but much more work remains. Models requiring labeled data or only working with toy data (even if unlabeled) will soon become irrelevant.


> The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.

I would expect that this will remain a per-task question for a very long time, maybe even forever. For example, I was amazed at how elegant https://openai.com/blog/unsupervised-sentiment-neuron/ is for an unsupervised model for sentiment analysis.

I think there are numerous similar application in text understanding and I would not be suprised to eventually seen many sota systems learn in an unsupervised fashion. Still, there remain many other tasks (in general, but also for text understanding) where this and similar strategies do not seem applicable to me -- or at least where I cannot directly imagine how they should be.

I would not be suprised if we end up with many, many successful strategies to apply unsupervised learning -- many completely different, and thousands of problems still being used with the help of supervised learning. Not with a "company to figure out" a general case


>Unsupervised learning is the dark matter of AI, as LeCun said. We cannot rely on labeled data, which is expensive and scarce.

This sounds very extreme to me. Sure, unsupervised learning is already bearing fruit (most obviously in NLP) but there are domains where labeled data is unlikely to disappear soon - in many cases semi-supervised learning is a more reasonable goal which will reduce the need for labeling (or labeling-related tasks).

>The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.

This is an even stronger claim. Why are you so sure only one actor will be able to leverage it? Yes, it will be a huge boon but claiming that the first to crack it will 'win' seems very farfetched.


> Models requiring labeled data or only working with toy data (even if unlabeled) will soon become irrelevant.

Care to present evidence for that conclusion or better define the value of “soon”? Most work I am aware of in my own field still requires copious amounts of training data – even with the progress that you reference – and generalisation is still a big challenge.


> The first company to figure out how to leverage the massive unlabeled data that is available will win the AI arms race.

Well, in many tasks, this is already solved. E.g. speech recognition or translation. Or just think about language modeling.

The main problem is mostly just computing time and/or computing resources. We have enough supervised training data to train our model for weeks. We basically have an almost infinite amount of unsupervised data. So yes, we could in theory make the model bigger and then train it for months or years. But that's not really practical currently.


What's the current sentiment on synthetic data? It may be a self serving question since I work at synthetic data company but I'm curious for fear of being stuck in an echo chamber.


It's no magic bullet. I think it can help in particular instances but see a lot of people chasing their own tails.


It's been very productive for us (in machine vision).


unlabelled data is the death of humans. Let AI train on unlabelled data, and it will soon learn that humans are a net drain and need to be killed off.


Well thats a really great initiative as this gives good money to the local villagers who are just enough educated to do this respectable office job. One concern is how they are handling the sensitive client data and why does client trust on them for sensitive proprietary data.


I agree that it is great to provide meaningful work and competitive to local norms salaries.

re: client trust: compare to systems like Mechanical Turk. An established data labeling company can monitor what employees are doing, provide ethics training/warnings, etc.


The question though is - do they get compensated appropriately for their efforts?


A lot of the times you end up using client end tools which does not require sharing data with these companies. You log into the clients system and do annotation or labelling on it without downloading the data.


There are captcha defeating click farms in the country, employees of which are paid ~2$/day.

I hope at-least data-set labelling empolyees are in a better position as they are expected to have better skill set. This is a better job than illegal/unethical farms.


Throwaway away storyline: future AIs perception is skewed by ‚the‘ Indian perspective on the world. The first self aware AI will feel and act Indian - what ever that means - and any other AI on a global scale, too :)


Late to the party. This article mentions AI based labeling tools and we're building one of them. If you're interested to try it out send me an email mik @ heartex.net


One of the earliest companies I know in this field was Playment, a YC company and mentioned in the article too:

https://news.ycombinator.com/item?id=13640084


I wonder, how much of it grew up from a "human captcha solving" market?


Grew more from the medical transcription, call center, back office and general BPO market. Lots of literate people, a couple dollars an hour is good money - in these towns you’d actually support a family with entertainment and a maid.


No no I think he might be right, India had a huge underground captcha solving marketat one poiny


Huh? https://en.wikipedia.org/wiki/Business_process_outsourcing_t... The BPO industry in India was $7-8B annually in FY06 according to the article employing millions of people. Do you have comparative numbers for the so called "CAPTCHA solving industry"?


Nothing official but see

https://www.zdnet.com/article/inside-indias-captcha-solving-...

I guess it's not entirely contradictory as it comes under outsourcing haha..

Also Im involved a bit in making automated captcha systems for.. ahem.. research. I've noticed a lot of the manual service providers in China have already pivoted to data labeling services. Most of them probably already have the workers, infrastructure, and pipeline in place, which is why I think original poster makes a good point


Meanwhile Google reCAPTCHA makes us label cars and road signs for free..


In exchange, we are exposed to a minimal amount of comment spam. If it's anything like email, systems like captcha prevent 99.9% of spam messages.

Besides, how often do you see captchas anyway? If you're not using super privacy / tracking protective browsers, they'll remain hidden for the most part, and the ones you see are the simple 'check this box' variety.


Because of how Cloudflare works, and how ubiquitous they are, specifically, internet users from a non-western country can get bombarded with recapchas. It is so prevalent and annoying that I had to resort using VPN (located on a western country), to avoid this nuisance.

It literally breaks the internet for me, I had to go through recapchas 12-15 times a day.


yeah cause we use their service for free


I've seen reCAPTCHA on services I pay for.


The services you pay for use Google anti-spam/bot-detection for free.


I would not take a bet that it didn't.


We (Humans) are helping our robot lord to build Skynet!


I, for one, welcome our new robot overlords.


I hope they have a sense of humor and drop some great Easter eggs for god 2.0




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: