
The Machine Learning Race Is Really a Data Race - prostoalex
https://sloanreview.mit.edu/article/the-machine-learning-race-is-really-a-data-race/
======
makewavesnotwar
It's incredible. The last company I worked for before going it alone (I was
the front-end engineer and moved away from ML based business models) was
trying to automate statistical analysis. I came from an academic background in
economics and I tried to propose modelling to them in simple terms and they
jumped to, "it sounds like you're talking about forecasting, let's go with
that." And they started trying to implement python models from academic papers
with highly limited training data and my reaction was generally, WTF?! You
can't even start to forecast without a reliable base model. But they went
ahead trying to sell stuff like churn prediction to companies with 0
understanding of how these models work at the basest levels.

And yeah, Google started to throw their hat in the game with Analytics 360 and
an enormously larger training base. Amazon's another major player.

Weirdly enough though, people do still blindly pay my previous employers to
figure stuff out because easy answers are always actionable, even if they're
wrong. It's just crazy because the CEO explained to me that lying about the
service to potential customers and investors was necessary because "Faking it
til you make it" was a sound business principle in his mind like 1980's
Michael J Fox was his primary sources of business info.

Long story short, don't waste your time with these little companies purporting
ML holy grails. They're probably just lying to you, whether intentionally or
not. ML is a game for the big boys with access to market level aggregates. The
models that last company came up with were wildly inaccurate.

~~~
ThePhysicist
I only partially agree. Building good ML models and even outperforming the ML
services of the big players is absolutely feasible. Have e.g. a look at this
talk from PyCon DE (in English:
[https://www.youtube.com/watch?v=XniwzOCWi2c](https://www.youtube.com/watch?v=XniwzOCWi2c)),
which shows how a small team built a machine vision system to read car
registration numbers from official documents. The system was built and trained
with an extremely small dataset (I think around 60 scanned documents with some
data augmentation) and was able to easily beat the Google Cloud ML algorithm
by an impressive margin (Google ML had an intolerably high error rate for this
seemingly simple problem).

So I'd say if you have a very specific area that you're investigating you have
a very good chance of beating larger players that don't specialize as much as
you can. Of course competing against Google in self-driving cars or machine
translation might be a bad idea, but even in those areas there are small
startups that produce impressive results (e.g. DeepL:
[https://www.deepl.com/en/translator](https://www.deepl.com/en/translator)).
Also, big companies regularly exaggerate their capabilities as well (sometimes
more than startups), just have a look at how IBM markets their Watson AI/ML
solutions, and what they deliver in reality.

So personally I'd say it has never been that easy to build relevant and
interesting ML/AI based solutions as a small team, and it is possible to beat
large players if you have the right approach and the right (very narrow)
problem.

~~~
paganel
DeepL is a very promising thing. I was very sceptic on the future of automatic
translation seeing as Google Translate seems to have stagnated for the last
two years or so, but I’ve just recently tried DeepL on a German newspaper
article a couple of days ago and it did a very good job. Granted, I don’t know
German (hence why I used DeepL) but nevertheless the English translation
provided by DeepL seemed more polished than what Google Translate usually
does.

~~~
snovv_crash
I've used it a fair amount, and continue to be amazed with the quality it puts
out. There are still some issues with formal pronouns, subject-matter-specific
contractions etc, but otherwise it does a great job with both EN->DE and
DE->EN

------
ohthehugemanate
Data quantity and quality are key. Both, though. This is why it's foolish to
go up against an ML product from Google, Facebook, or (maybe) Microsoft. You
just can't compete with the volume and quality of data they can access.

In sectors like automotive, where every brand is competing to try and build
the best predictors using only their own data, there is a huge opportunity
available for the first two companies to share data with each other. Doubling
your data quantity brings a significant improvement to any model, and would
put them ahead of any competition. That advantage only grows the more players
you add to the sharing pool.

I believe that if humanity is really going to harness machine learning, the
concept of a bulk data commons is an inevitable requirement.

This makes decentralized personal data control, homomorphic encryption, and
similar technologies incredibly important.

~~~
visarga
> This is why it's foolish to go up against an ML product from Google,
> Facebook, or (maybe) Microsoft.

I was sure someone will say this. I think it only applies for advertising,
commerce and insurance. When it comes to learning ML models on images, text
and games, big corporations don't have a unique trove of data. They only have
the data advantage when it comes to personal data.

The more important advantage big corporations have is hiring the best and
hiring more ML scientists and engineers. The job demand far outstrips the
offer.

~~~
nostrademons
ImageNet has 14 million images. How many images does Google have available to
them? They've got about 1000+ of my personal photos backed up on Google
Photos, so if each of their 200M monthly active users is like me, that's 200
_billion_ images from Google Photos alone. Then add every image crawled from
the WWW for image search (that was several tens of billion when I was there,
each tagged with text & structured data from its page of origin), every Street
View photo (that was thousands of hard drives worth, measuring in the
petabytes total), and high-res satellite images of the entire earth through
their SkyBox/TerraBella acquisition.

The Big Tech companies like to understate the size of their data advantage
because it tempts competitors into doing stupid things that won't work in the
market. Don't be fooled though - the majority of useful data is locked up
inside proprietary silos.

~~~
laichzeit0
Sure they have those images, but so what, it’s all unlabeled data.

I think the labeled datasets we create for them with e.g. reCAPTCHA is
infinitely more useful for training.

~~~
shostack
Are they unlabeled though? If location sharing metadata exists, that's a
decent chunk of location-tagged photos of places to have.

------
DATACOMMANDER
The article is decent, but I see two mistakes: the author assumes that AI/ML
can’t produce unique insights from public data; and she conflates AI/ML with
automation. While it’s true that you won’t find anything that your competitors
haven’t if you use the same data _and the same AI /ML techniques_, there’s
nothing stopping companies from differentiating on the techniques in addition
to (or even instead of) the data. If you just use plug-and-play AI, then sure,
you’ll need a unique data set if you want unique results.

The section about finding faster, less error-prone ways to apply existing
insights sounds more like automation than AI. There’s certainly overlap, but
they’re two different things.

------
Xcelerate
It's a data race because we've run up against another wall on the algorithms
side. Find a technique that works better than GBDT for the same type of
problem. Other than some minor tweaks described in the academic literature,
it's been a while since something really advanced the state of the art.

Small datasets still have massive predictive potential; we just need better
algorithms. (As an extreme example, suppose I give you the first 30 digits of
pi or e and ask you to predict what comes next. Despite being a small amount
of data of low algorithmic complexity, machine learning cannot currently
handle this type of problem.)

~~~
ydj
The pi and e example seems more complicated than it looks. If you ask a human
who doesn’t know about pi or e, how much effort would it take for them to
figure out the next digits? Seems like they’d have to rediscover the math
first (or I suppose, perform a google search)

~~~
montenegrohugo
Yes, it would be a hugely complicated undertaking and probably impossible for
most humans with little academic mathematical knowledge. But the point is that
it _would_ be possible, which indicates that the problem does not necessarily
lie in the amount of data but in the algorithmic approach itself.

ML is a great tool that is creating very real and tangible value, but it still
has ways to go. Just adding more computational capabilities and more data will
only bring marginal improvements.

------
gesman
THIS.

I was just saying to our partner (as well as to my wife!) how lucky we are to
work on Healthcare solutions. We have access to data about medications,
opioids, patients and physicians behaviors that so very little of others (who
can have any clue about data analytics) has.

The realization of sitting on a goldmine of impossible-to-access-data +
capability of developing cutting edge analytics solutions that could change
the world is the best place to be.

~~~
tonyhb
Google partners with the NHS to get medical data.

------
yonkshi
The sample efficiency of ML systems are increasing rapidly in the theoretical
realm, though the available data are growing magnitudes faster than the
progress in ML sample efficiency. The article has a valid point: It's far
cheaper to hoard data than to invent a more efficient ML system, thus we will
see people race towards data rather than technical complexity.

~~~
fooker
>The sample efficiency of NN are increasing rapidly in the theoretical realm.

Can you provide some references for this?

~~~
yonkshi
Sure. I don't have any holistic survey to prove my point, but an example of
recent progress in terms of sample efficiency is this paper[0]. Derivatives of
this paper have been used to solve Sudoku[1], Starcraft II[2] and more [3].
This paper enabled more efficient use of data by creating a probabilistic
graphical model between logical sets.

[0][https://arxiv.org/abs/1706.01427](https://arxiv.org/abs/1706.01427)
[1][https://arxiv.org/abs/1711.08028](https://arxiv.org/abs/1711.08028)
[2][https://arxiv.org/abs/1806.01830](https://arxiv.org/abs/1806.01830)
[3][https://arxiv.org/abs/1806.01822](https://arxiv.org/abs/1806.01822)

~~~
state_less
I like the 3D datasets in these papers since, like a scientist in a lab, you
can setup the experiment and explore the domain. Adding time in would be cool
too (e.g. Are the blue and red ball going to collide in 10 seconds?)

It also helps to be able to show you can answer some of these questions in
principle with your model. It gives you hope that it might be able to cover
real world images.

------
minimaxir
The neat thing about big data is that there are massive diminishing returns on
the amount of data vs. model quality.

What's more important is data _quality_ (e.g. structured and unbiased), and an
incumbent with the right approach can have a strong impact.

~~~
dumbfoundded
It's not just quality but it's also the available features in the data.
Quality is just a signal to noise problem. If you have enough data, you
generally can reasonably segment it to get higher quality. Obtaining features
not present in other data sets is probably the most significant factor.

For example, let's say you want to build a speech rec engine. You need 15K
hours of data to build/validate a model. How would you get that? You could
farm it out to some people on mechanical turk and get 15K hours of audio
transcribed. With enough money, you could duplicate the transcriptions enough
times to actually be pretty sure about the quality of the data set. If you're
clever and have a large enough dataset, segmentation generally gets you decent
quality. The big gains come in when you have features not present. For
example, google realized when you build a speech rec engine, you can include
video data and image processing to actually use the way people move their
mouths to significantly increase the quality of an automated transcription.

~~~
state_less
I think creative new ways of harvesting data will continue to be profitable.

That's fine with today's mindset, but over time it seems like we'll want to go
beyond big data and think about how it is that a child can become quite
capable without seeing millions of instances of people crossing a road, etc...
We have to make our machines make models and 'want' to gather data to test
them. By having the machines do the work of today's data scientist, they may
well put themselves out of business, and quickly everyone else.

------
autokad
as a data scientist, it has always been about the data. it has always been a
data race. google has known that, which is why they spent billions trying to
protect their search moat.

~~~
anongraddebt
This. All things being equal - and assuming similar data quality among
competitors - the one with the most data has a stronger strategic position.
This is because, all things being equal, more data means a larger number of
available choices and insights (I'm making some simplifying assumptions here).

Business strategy is fundamentally about trade-offs, though, so we do need a
caveat. Naturally, more choices and more insights can, at times, be a
weakness. You always have to prioritize, and as data volume increases it
doesn't seem the ability for quality prioritization grows in parallel (or at
least, necessarily does).

------
synaesthesisx
Mark my words - organizations like Amazon, Google, etc will soon start
offering a Data Marketplace (intended for both buyers and sellers of all sorts
of "alternative" data - everything from small business metrics to
enterprise/b2b and anything in-between. The next logical step would be to
offer insight/models as a service as a layer built on top of this..

I've been studying their moves carefully and have no doubt this is where
they're headed. While I don't work for Amazon I think AWS in particular is
uniquely positioned for these next couple of moves to a level that may make it
challenging for others to compete. Interesting times...

~~~
joe_the_user
Indeed, the clever thing would be to have an API flexible enough to allow a
user to utilize their own ML architecture and training process but it still
keep that user from accessing the raw data.

The ideal approach would somehow allow this anonymized access to multiple
large databases simultaneously. I don't know how you'd do that but if you
claim Etherium would help, you'd arrive at buzzword nirvana even if you were
wrong.

------
sirwitti
Me personally I'm not 100% convinced that the current ML/AI approaches (that
I'm aware of) will yield new big steps forward.

Neuromorphic computing could have the potential I think, if we can build
hardware that's good enough.

~~~
visarga
This year brought to us huge improvements in handling text (BERT), speech
(Google TTS, especially the Allo demo), images (ProGAN, StyleGAN, BigGAN) and
activities such as games (Alpha Zero), and robotics. Even music composition is
improving a lot. I don't feel it is slowing down yet. And when it will
eventually slow down, it will be all for the better - we will have time to
reopen all those different approaches that have been more or less ignored
because of the DL hype.

I think the key fields in AI will become simulator based learning (RL) and
graph processing neural nets, because graphs can express any kind of highly
dimensional data and are useful for reasoning tasks. They marry the symbolic
and connectionist approaches. These two subdomains have had rapid evolution
over the last couple of years. They also solve the data problem - in
simulation you can produce as much data as you want, and graphs have
combinatorial generalisation, thus they work on new configurations without
retraining.

------
lordnacho
It kinda can't be any other way though.

What are models? They're a way to describe the data. So this is where a load
of philosophical stuff like Occam's Razor comes in and favours things like
having fewer degrees of freedom, lower errors, etc.

What is data? It's what makes one model a more likely explanation than
another.

You can make up any number of ways to describe some phenomenon, but without
data there's no way to tell which of them is better. Or rather, you will fall
back on some model with fewer specifics, because of those considerations we
mentioned earlier.

So getting smarter (ie more complex) with models can't help on its own.

~~~
joe_the_user
Human beings can be smart in using a fairly small set of data. Humans have an
enormous store of data to work with but when confronted with a bit of new data
on a new subject, they can sometimes do very well.

This is a result of the ML paradigm essentially starting new with whatever
data set it is trying to "solve" but overall, the paradigm doesn't have to be
that.

~~~
PeterisP
IMHO pretty much every example I've seen for how humans can get good results
with using a fairly small set of data actually involves successful
generalization / transfer of "generic life experience" or "generic audiovisual
processing" to that new problem instead of actual learning from limited data.
We know how to do that in ML, in general - we can do transfer learning from
few examples quite well if the underlying generic data is good enough.
However, we currently don't have good enough underlying generic data to match
the years of generic life data that any human kid has accumulated.

A human learning to play an Atari game that involves an agent jumping over a
pit has to learn only the mechanics of that game, and that can be done in
minutes. Learning that game _from scratch_ , on the other hand, requires also
learning interpreting vision and the whole concept that the world has objects
that may move around - which takes months of learning for human brain. So
comparing sample efficiency is an apples to elephants comparison if we
disregard the ability to reuse/transfer knowledge from related tasks that all
humans learn during childhood.

------
karmasimida
Model is cheap, data is HARD.

Plus investing is data is much more predictable: the outcome is always going
to get better, though the margin will be diminishing, but better is better.

While modeling is not, hiring 100 machine learning 'experts' will not solve
the problem 100 times better than 10 of them. On the other side, 100 labelers
are surely going to provide 10 times of throughput provided the same upscale.

~~~
aqme28
> though the margin will be diminishing

So by the metric of model performance (the only thing that matters here), what
you're saying is that hiring more labelers is actually not linear.

~~~
karmasimida
True, of course. We can only go as high as 100% for accuracy, right?

However the labellers are scalable in terms of their throughput and coverage
of the data, you can always find bad examples or holes in your current data
plane, and that is the time when labelers, not the scientists, are going to
rescue you

------
mslate
No, it's a process race. The data is just the most visible "asset" in play.

The entities that own the best data mining process will win. This includes
data collection + ETL/storage + model training/deployment.

Obtaining a vertical monopoly on this process is the goal.

------
nuclx
"Machines need a lot more data than humans do in order to get smart" \- is
that true?

~~~
Doubleslash
Yes. On average the human brain just needs dozens or hundreds of examples
(turns, exercises, you name it) to "learn" something. A decent machine
learning model needs hundreds of thousands to millions of samples to gain good
confidence and can still be fooled easily afterwards with subtile changes.

~~~
m0zg
Actually, I don't think this is strictly speaking accurate. In order to train
from those dozens/hundreds of samples, human brain needs to first be developed
enough by accumulating experience from billions of samples in related domains.
This "pre-training" process literally takes years, and it still does not
sufficiently prepare some people for some tasks.

~~~
sbov
I just think machines are good at some things, and people are good at others.
If it really took billions of samples in related domains we wouldn't develop
nearly as fast as we do after being born.

~~~
m0zg
I'm not sure I'd call it "fast" either. For the first three months babies can
barely see anything, and for at least nine they can't form anything even
remotely resembling speech and can't walk. What's amazing is that all this
learning is very sparsely supervised and all "subsystems" train at the same
time.

~~~
pdimitar
This is a human peculiarity. Many animal babies are born with fully developed
abilities. I am sure we have all seen the NatGeo videos of antelope babies
stumbling 2-3 times and then immediately start walking and even running.

Human babies have bigger head-to-body ratio compared to all other species due
to our brain being bigger. Our babies have to be born earlier or otherwise
they cannot make it out of the mother alive.

Outside of that, we develop pretty quickly. As you pointed out, everything in
us is developing in parallel which is quite impressive.

------
miguelrochefort
What will happen is that we will move toward an agent-centric system like
Holochain, and put all data about our environment there.

Only when people will realize that transparency is better than privacy, will
they start putting all their quantified self data on there.

This will become a Decentralized Artificial Intelligence network, from which
consciousness and AGI will emerge.

------
craftinator
I don't understand how this is a race at all. There is no way to finish it,
and "going fast" is likely to cause major issues in developing sound
technologies. Maybe changing the attitude of this article to a more
scientifically sound position would increase it's relevance.

------
mark_l_watson
I have just started working with Ocean Protocol
([https://oceanprotocol.com/](https://oceanprotocol.com/)) and I have set up a
local meetup in January for local entrepreneurs looking for access to machine
learning data.

------
miguelrochefort
Don't they have more than enough data by now? What are they waiting for?

------
buboard
Data is cheap though and a very effective way to get them is crowdsourcing.
The differentiation may not last long.

------
DrNuke
There are reliable ways to generate unbiased, synthetic and even big data for
almost any industrial domain out there plus a lot of applied research fields
in the STEM curricula. The most important issue nowadays is the accountability
of results aka ablation studies following too many false positives and a
number of recent, blatant scams.

------
zozbot123
Clearly, the solution is to develop a machine learning framework in Rust -
Rust is 100% immune to data races!

