
How to create an AI startup – convince some humans to be your training set - simplystats
http://simplystatistics.org/2016/03/30/humans-as-training-set/
======
AznHisoka
"It will also be interesting if there is a legal claim for the gig workers at
these companies to make that their labor helped “create the value” at the
companies that replace them."

Well, I would certainly hope any employee help create value for the companies
they work for... even if they get laid off eventually.

~~~
exolymph
The key word here is "legal". Without having a contract to that effect, no
employee or contractor can just appropriate equity, regardless of how much
sweat they put into building the company. I'm not sure why OP thinks there
might be a legal claim.

------
danblick
I think he's _really_ missing the point about the importance of self-play in
Alpha Go. Human play provided a seed for training the system, but the thing
that made it work was the fact that the computer could play an unlimited
number of games against itself; the fact that Go is a game with clear rules
made it possible to label a huge number of board positions without any human-
derived training data at all. The human-derived training set isn't nearly
enough for this.

~~~
nazka
Do you have first hand sources about this? I have now been hearing all and
everything about what makes Alpha Go so great... First it was the hardware,
then it was the use of the Monte-Carlo tree search with NN... And even more
just 1 day ago
[https://news.ycombinator.com/item?id=11382954](https://news.ycombinator.com/item?id=11382954)

~~~
danblick
"the AlphaGo algorithm, this is something we’re going to try in the next few
months — we think we could get rid of the supervised learning starting point
and just do it completely from self-play, literally starting from nothing."

[http://www.theverge.com/2016/3/10/11192774/demis-hassabis-
in...](http://www.theverge.com/2016/3/10/11192774/demis-hassabis-interview-
alphago-google-deepmind-ai)

(This is why I take exception to the claim in this blog post that the
supervised training data was critical to success...)

~~~
nazka
Ok thank you for your answer. So many things are claimed that it is hard to
track what is real and what is just hype.

I agree that it's something big. Training Alpha Go on itself means something
bigger than "just" optimizing a statistical model on human data. I think
recognition of logic elements to strategy planning are parts of what will make
ML really close to IA. (With memory, and cleverness to learn) And are the next
big steps.

------
morganK
Would have like to hear at least one concrete exemple of startup actually
doing that. Seems a bit theoretical at the moment, as big companies doesn't
need to do that thanks to existing datasets, and I've never heard any startups
using dozens (hundreds?) of contractors for this kind of job.

~~~
tariqali34
Netflix used humans to tag movies for their recommendation system.

Source: [http://www.theatlantic.com/technology/archive/2014/01/how-
ne...](http://www.theatlantic.com/technology/archive/2014/01/how-netflix-
reverse-engineered-hollywood/282679/)

~~~
LunaSea
Netflix is not a startup.

~~~
true_religion
At one point, Netflix was a startup.

~~~
LunaSea
Yes but it wasn't in 2014 or 2012.

------
lifeisstillgood
This does hit at one of the most basic debates of the next decade - how much
of my actions and behaviours do I own? Creating a link from one page to
another, thus providing PageRank with value - do I get a cut of that value?
Purchasing a book or a film, thus making profit for the reseller's
recommendation engine? Driving around populating maps with my GPS co-
ordinates. Just generally leaving digital footprints makes someone a training
set somewhere - and yet instead of this being a public good it's private
profit - the term bandied Around after 2008 was "socialising risk, privatising
profits". The same debate should be happening here - but I only occasionally
hear about something like it.

Or am I listening in wrong places?

------
thinkingkong
It wont work this way in the short term.

Any company doing "AI" will get there over a long period of time by employing
people to do actual work and then slowly automating that work away. If you
wait for a huge dataset or some new technique there will be tons of
competition.

------
zodPod
>It will also be interesting if there is a legal claim for the gig workers at
these companies to make that their labor helped “create the value” at the
companies that replace them.

I'd assume that you'd be waiving any legal claim they might have when they
sign the ToS or w/e. I mean, in all fairness, they are getting paid to perform
these actions and be recorded. What more would they have any claim for anyway?
A percentage based on the times their anonymized playthroughs were used?

"Well, we've got 1,000 people and each played 100 games of Go. We took that
100,000 games and trained a single dataset to play against itself." User is 1
player of 1,000. Company makes 20,000,000 and sets aside 25% (magically) for
paying back the original people. Those people now get $5000. That $5000 is
cool but it's not life changing.

EDIT: It occurs to me that my numbers could be skewed. This could be
significant if they only used 100 people or so, I guess. My point wasn't
necessarily to shoot down the notion just to discuss it. What would the person
have a claim to be it legal or otherwise?

------
pbkhrv
Microsoft, perhaps inadvertently, did that. Tay's stream of consciousness can
now be used as a training set for an abusive content monitoring AI.

~~~
bliti
You could crawl 4chan and get a bigger dataset of abusive content. But that
cold lead to terminators showing up on my lawn.

------
tariqali34
The interesting question is what would you call these humans who are serving
as your training set. Do you call them "Machine Therapists" (trying to coax
the AI to proper behavior)? "AI Educators" (providing the material that is
used to teach the AI)? "Data Scientists" (they are curating data and handing
it off to the machine)?

~~~
pdkl95
Hopefully they call them "people who gave their informed consent to use their
data in this specific AI project".

------
stcredzero
Searle would have us interpret this as the company taking the intelligence of
the humans, refining and repackaging it.

[https://www.youtube.com/watch?v=rHKwIYsPXLg](https://www.youtube.com/watch?v=rHKwIYsPXLg)

------
nxzero
Unclear how this is new, even Google, Amazon, etc. have either been doing this
internally, offering it as a service, been susceptible to man-in-the-middle
exploits to mining real world data for training sets, released data, etc.

~~~
awinter-py
Spot on. One recipe to become a tech acquisition target is to collection a
'new kind' of user data -- all big companies are hungry for this.

This phenomenon is not at all new; data has been informing investment models
forever and access to that data comes from having the right customers, and is
closely hoarded once gotten.

Some of the largest companies in the late middle ages were wool buyers -- they
weren't permitted to trade internationally, but they used locally owned
franchises and market knowledge to corner the market anyway. And many of the
largest ag commodities futures traders in this century also own substantial
farm acreage. Those capital one guys who were SEC'd for trading options on
credit card receipts were leveraging customer activity.

Point being -- you've always needed data to train a good model.

------
graycat
With the many parameters, the _normal equations_ will become large. In that
case, can consider solving the equations with the old iterative method Gauss-
Seidel.

------
verbify
I've only a little experience in NN, but getting trainers is rarely the
bottleneck - it's usually in programming the NN.

------
tmaly
I plan to do just that, but my end goal is to provide a free service that has
tons of value for my users.

------
graycat
There's a chance that some Web site ad targeting is being done this way.

------
madelinecameron
This is kind of "no duh".

Not really an article that adds much value or understanding, especially for a
blog seemingly being targeted to a technical audience.

