
Ask HN: Why do so many startups claim machine learning is their long game? - crtlaltdel
I work with and speak to many startups. When I ask questions around the product value, especially in context of defensibility, they assert that their &quot;long term play is using machine learning on our data&quot;. This has been pretty consistent in the last few years, regardless of the nature of the product or the market for which the product is targeted. This typically comes with assertions such as &quot;data is the new oil&quot; and &quot;once we have our dataset and models the Big Tech shops will have no choice but to acquire us&quot;.<p>To me this feels a lot like the claims made by startups I&#x27;ve encountered in past tech-hype-cycles, such as IoT and blockchain. In both of these areas there seemed to be a pervasive sense of &quot;if we build it, they will acquire&quot;.<p>The question I have for HN is in two parts:<p>1. Why is it that a lot of startups seem to be betting the farm, so to speak, on &quot;machine learning&quot; as their core value?<p>2. Is it reasonable to be highly skeptical of startups that make these claims, seeing it as a sign they have no real vision for their product?<p>[EDIT]: add missing pt2 of my question
======
peterwoerner
Because there is a real moat with data ownership and pipelines. If you want to
do any analysis you quickly find that learning to properly use scikit-learn
and tensorflow (or your machine learning algorithm of your choice) is atleast
an order of magnitude lower of work than getting the data. For instance, I
wanted to build a machine learning algorithm which took simple data from the
SEC filed 10-Q and 10-K, which are freely available online and predict whether
stocks were likely to outperform the market average for over the next 3 years.

Time to setup up scikit learn and tensorflow algorithms to make predictions: 4
hours. Time to setup python scripts which could parse through the excel
spreadsheets, figure out which row corresponded to gross profit margin, and a
few other "standard" metrics: ??? unknown because I gave up at about 80 hours
trying to figure out rules to process all the different spreadsheets and how
names were determined.

I had a professor who was doing machine learning + chemistry. He was building
up his own personal database for machine learning. He spent ~5 years with
about 500 computers building the database so that he would be able to do the
actual the machine learning.

~~~
axegon_
So much this!!! In all fairness, it doesn't matter what you pick up, you'll
spend north of 80% of your time preparing and pre-processing data, which is
almost always catastrophically tedious and boring. Annoyingly that also
applies to publicly available datasets - pre-processing is still most of the
work. For instance, the tensorflow team invested a lot of time and effort into
tf.data for that reason but imo it doesn't make things a whole lot better.

~~~
geebee
I agree with you, though I actually personally don't find pre-processing data
tedious and boring. I kind of like knitting it all together.

On a side note... data pre-processing is often viewed a side job that needs to
get done before the real work can begin. I don't think I've ever been able to
prepare a data pipeline without making decisions about the data that will
impact the outcome. For example

How do you deal with missing data? Interpolate, ignore, use averages, use a
machine learning algorithm to plug the gaps?

How do you decide what data sources to include in the pipeline. What if one
data source seems more reliable, but another has far more data, too much to
use. Should you amplify it, sample from the other data sets to make the
volumes equal, keep things proportional?

What if one data set changes more rapidly than another, how should you update
the data set used to populate the ML model?

These are just a few that jump into my head, and while there are techniques to
deal with them, ultimately, there isn't a correct answer. And honestly, even
these examples make it all seem more glamorous than it is, a lot of this is
just figuring out why various encoding and formatting errors are breaking the
feed, why column headers mysteriously change, why handwriting on form scans
gets properly translated into text some of the time and completely garbled in
others.

The funny thing is, I do see people fine tuning ML algorithms (kaggle style,
seeing if they can wring a bit more predictiveness out of a model), when in
the real world projects, decisions upstream about the data pipeline will have
an impact perhaps 5-10 times greater than any tweak to the ML parameters (or
even which general algorithm to choose). And yet, it's hard to get people to
even pay attention to these decisions, probably - as you said - because people
find it catastrophically tedious and boring.

~~~
emmanuel_1234
As a corollary, "Data Scientists" who can't program their way out of a paper
bag (e.g.: write simple SQL, or a scraper in Python) is near useless, and a
stress on their peer who can.

I'd much rather hire a good programmer with some statistical knowledge than
the other way around.

~~~
dawg-
Either way, you are gonna be paying someone to fill in gaps in their
knowledge. Anyone with a laptop and an internet connection can learn to
"program their way out of a paper bag" in a weekend.

~~~
geebee
I kinda disagree, though I suppose it depends on the paper bag.

------
tlb
There are surely some startups for which this is bullshit. But the good
version of it is:

\- take some valuable task that's never been successfully automated before

\- do it manually (and expensively) for a while to acquire data

\- build an automated system with some combination of regular software and ML
models trained on the data

\- now you can do a valuable task for free

\- scale up and profit

The risk is that it's hard to guess how much data you'll need to train an
accurate, automated model. Maybe it's very large, and you can't keep doing it
manually long enough to get there. Maybe it's very small and lots of companies
will automate the same task and you won't have any advantage.

I think there'll be some big successes with this model, and many failures. So
be skeptical -- ML isn't a magic bullet. But if a team has a good handle on
how they're going to automate something valuable, it can be a good bet.

As an investor, you may well face the situation down the line "We've burned
through $10M doing it manually, and we're sure that with another $10M we can
finish the automation." Then you have to make a hard decision. With some
applications like self-driving cars, it might be $10B.

~~~
buboard
This seems too easy a recipe to be worth it in the medium term - there is no
moat. Better cover your data with very strict laws, like google does with its
exclusive deals for medical data use.

~~~
paulddraper
Google didn't add any "laws" around that data though.

The laws were already quite strict (HIPAA) and the data relatively
inaccessible.

Google jumped through all the hoops to get it. And an exclusive contract never
hurts no matter what the service. (Logistics, payment processor, etc.)

~~~
buboard
> an exclusive contract never hurts

Thats why this is unfair. The health industry incentivizes (and often
mandates) open publishing of scientific results, but patient data is reserved
with gold chains for the exclusive use of google

------
smacktoward
You forgot the second part :-D

I think the answer is pretty simple: it sounds good, and it's hard to
challenge. It's essentially a promise that they will invent a black box with
magic inside. Since nobody can see inside the black box, it's hard to argue
that there isn't actually magic in it.

The long-term problem, of course, is that there aren't that many actual
magicians in the world. Most of the people who bill themselves as magicians
are either people who just _think_ they're magicians, or people who know they
aren't but don't mind lying about it.

~~~
crtlaltdel
Thanks! I have edited my post to account for my missing part2.

------
tetha
Ugh. Maybe I'm in a different world by now, but I dislike such statements on
multiple levels.

> This typically comes with assertions such as "data is the new oil" and "once
> we have our dataset and models the Big Tech shops will have no choice but to
> acquire us".

Maybe it's me, but I dislike the attitude to work to be acquired.
Interestingly, this is a rift I see quite a bit if I interview more
development oriented guys, or more infrastructure oriented guys. Tell me I'm
wrong, but development oriented guys tend to be more fast paced and care less
about long-term impacts. Infra inclined guys tend to be slower paced, but
longer-term oriented. Build something to last and generate value for a long
time.

> 2\. Is it reasonable to be highly skeptical of startups that make these
> claims, seeing it as a sign they have no real vision for their product?

From my B2B experience over the last few years, and working towards a stable
business relationship with large European enterprises, yes. My current
workplace is moving into the position of becoming a cutting edge provider in
our place of the world. This is a point where machine learning and AI becomes
interesting.

However, we didn't get here by fancy models and AI. We got here by providing
good customer support, rock-solid SaaS operation, delivering the right
features, strong professional services, and none of those features were AI.
It's been good, reliable grunt work.

Different forms of AI are currently becoming relevant to our customers,
because we have customers that handle 5k tickets per day with our systems and
they have 3-4 people just classifying tickets non-stop. We have customers with
30k - 40k articles in their knowledge base, partially redundant, partially
conflicting.

This is why we entered a relationship with a university researching natural
language processing among other - and they will provide us with a big selling
point in the future. And they are profiting from this relationship as well,
because they are getting large, real world data sets they couldn't access
otherwise. Even with a good amount of pre-processing by the different product
teams.

But as I maintain, nothing of that form has brought us where we are.

~~~
JohnFen
> I dislike the attitude to work to be acquired.

This is my preferred method of doing business. I start a company with the
intention of selling it to someone else in the end. I do this because it fits
my personality the best -- once I have solved a technical problem, I lose
interest in it and want to move on to the next thing.

However, there's a right way and a wrong way to do business with this sort of
goal, and your comment here puts the finger on the difference:

> Build something to last and generate value for a long time.

This is an essential part of what I need to do in order to sell. It's not time
to sell my company until the company is already profitable and set to work
over the long term. That's where the real value proposition is for buyers --
I'm not selling a technology or piece of software, I'm selling an established
business.

In my opinion, people who start businesses with the intent to dump them at the
soonest opportunity are really just doing the same thing as people who make
money flipping stocks. There's nothing wrong with that (it's just not for me),
but it's a completely different sort of thing. It's money-spinning without the
intention to build anything of lasting value.

------
thecolorblue
I agree with other points here, but I also think nobody wants to wake up 2
years from now and be the only company that was not investing in machine
learning. It could turn into nothing, or it could be 100x the time and money
put in now.

So, of the 4 outcomes:

(1) buy in now + worthless

(2) buy in now + 100x

(3) don't buy in + worthless

(4) don't buy in + 100x

(4) is a terrible position to be in, (2) is a good position to be in, and (1)
only costs a little, and (3) is even. So when you are deciding to buy in or
don't, you are deciding between good + little loss or terrible + even.

~~~
nikanj
Sounds like a new version of Pascal’s Wager

~~~
quickthrower2
John Carmack said exactly that in the FB post

------
jacquesm
Because they believe it will increase their chances of getting funding. The
way ML is dragged in by the hairs in some propositions really is just painful.
The funny thing is the few start-ups that I've seen that actually needed and
used ML properly did so quietly because it gave them a huge edge over their
competition who had not yet clued in to that fact.

------
floatingatoll
Machine learning is a cost reduction multiplier for staffing costs, which are
perhaps the highest cost for technology firms until they become successful.

Google and Facebook use algorithms to try and minimize dollars spent on
moderating their sites, with limited success. If they can avoid paying human
beings to make judgement calls, they save billions of dollars a year.

It’s reasonable to be skeptical of startups who claim that they’ll use ML
someday. Ask them how they’re using human labor today to perform the tasks
that they’d like ML to do, and what their runway is for that work at their
current burn rate. If they don’t deliver a viable ML labor reduction by that
time, they will either collapse or worsen their service.

------
odomojuli
The real moneymaker is what's under the hood of many machine learning models:
data laundering and labor externalization.

Consider this: a good deal of open datasets are stolen. People had their faces
taken without consent. The data was tagged and labelled for pennies per image
on some server farm. Wrapping it up in machine learning can effectively black-
box what it is you're actually doing.

------
joddystreet
Software products can be layered into 3 parts -

\- data collection (frontend, APIs) \- data storage (database, backend) \-
data visualization (dashboards, analytics, reports)

To bootstrap a startup, the primary people you need would be - a frontend
person, a backend person, and a product person. In the beginning, you would be
dreaming about the possibility of using ML but would not invest in hiring a
data scientist at that stage. The second stage would be to hand over the reins
to operations people and let them optimize the internal and external processes
and get ready for the launch (growth). These phases can take-up any amount of
time, from 2 to 5 years. Finally you handover the reins over to the
salespeople and go back from product to services mode (especially true for
enterprise software). During this stage, you would be hiring a data scientist
and a team of data analysts. After, around 6-8 years of bootstrapping a
company. Unless your product relies heavily on the ML algorithm, it can and
would always wait.

Figuring out where to first use ML is another challenge. Hiring a data
scientist and ask them to tell you what to do is a futile effort.

ML is a useful tech and if you have to keep thinking about utilizing this
technology to improve your product and processes. It has to be a part of your
long term efforts.

------
warrenronsiek
Most companies aren't interesting enough to invest in without some kind of
secret sauce. Claiming that they are going to use 'AI and ML' is a way of
saying that 'yes, we are a company with tons of competition and little
competitive advantage, but we will have a secret sauce eventually. We don't
know what that is yet, so we are using jargon as a placeholder.'

Call me a cynic, but I think the talk of building a data moat is mostly
nonsense. Of the companies that make these claims, how many of them are
actually hiring expensive data engineers to build the moat? If don't have a
team working on it, then its a ruse.

Unless the startup is founded by people with deep experience in ML and have
been working on using it extensively from day 0, it is unlikely that they will
be able to deliver on this vision. They won't be collecting the data correctly
if they are doing it at all. If they are collecting it correctly (they
aren't), they won't be able to get it to a useable form. If they get that far
(they won't), they then need to build out the ML Ops to deliver their models.
Now, finally, they can `from tensorflow import *`. Engaging this process post
hoc takes YEARS.

------
yters
This reminds me of Altman's justification of his AI venture. Once he solves
AGI, then all the monies is his, and whomever else was lucky enough to invest
in his 'lightcone of the future'.

It seems AGI is the perpetual energy device or philosopher's stone of our era.

Enough so that I'd consider starting a hedge fund to short all the AGI
companies :)

------
OnlineCourage
Because rule based software is now a commodity, and therefore has lower
margin. ML is more risky to achieve, may not work on a given dataset, and
requires scarce specialized thinking, therefore is worth betting capital on.
ML is the new stock while software is the new bond.

------
winrid
Because they are probably temporarily using mechanical turks. If they admit
that their evaluations will be terrible because that kind of company cannot
scale like a SaaS company - it'll be valued like Professional Services which
is "not good".

So if you say you're going the ML route your perceived value is much higher.

------
paulcole
The key is that they’re not betting the farm on ML. Instead they’re betting
_someone else’s_ farm on ML.

------
solidasparagus
It is one of the two obvious economies of scale for pure software companies
(along with network effects). I can't think of a company that should not have
ML in their long-term plan.

Some people probably talk about it without truly understanding it (it can take
years before you have enough data to build good ML models, particularly if
there are seasonal trends), but I certainly wouldn't judge a company poorly
for seeing ML as an important long-term moat. I would judge a young company
that sees it as a short-term moat - if you can acquire valuable data quickly,
so can your competitors and it's not really an economy of scale that gives you
defensibility.

------
kgiddens1
The reason this is the case, is that one of the biggest gaps in the market is
not the technology itself (most of this is open source or variations there of)
but of proprietary data sets. This is valuable when transformed (as my company
does at www.edgecase.ai ) to annotated data. What we see is that companies
that a) invest in acquiring data and b) transforming said data and c) building
a model that is useful to customers (and acquires their data) is the way of
the future.

"Ai with unique datasets is an amazing moat"

So in sum:

1) in some cases this is true (but most of the time there is no unique data)
2) if they truly have unique data this is something to take notice of.

------
LoSboccacc
because VC money is hunting for ai startups, so everyone does what they did
when it was data analysis, predictive intelligence, augmented reality and
whatnot, they attach the buzzwords to the pitch in hope to get a foot in

------
ebrewste
ML does two things, ideally: \- Makes meaning from data (this is customer
value) \- It insulates the raw data from competitors. So the customer gets
their actionable insights from your algorithms and your competitors can't run
algorithms on your raw data, trying to do a race to the bottom with you. This
works in two scenarios: !) lifestyle businesses where it isn't worth it for
would be competitors to generate their own data and 2) big projects where the
first mover gets an unfair advantage from huge data sets.

------
grumpy8
My take on it is you can add so much more value to a product with ML, but to
succeed you need to have a lot of data.

So you're getting into a "We need to grow really fast to get more data than
our competitors so we can add ML and create more value to the product so we
can grow faster and get ahead of competitors".

I.e. ML is a competitive advantage which is often hard to come by for
startups.

------
gsich
From my experience if someone says they are doing such things it usually boils
down to:

Machine learning == statistics

Deep learning == machine learning

------
rubyfan
1\. I believe the term is “hand waiving” 2\. Yes, or maybe worse they have no
business model.

------
sys_64738
It’s the buzz phrase that requires being part of every corporate’s mission.
Even my trash company is into machine learning.

------
buboard
It's like the gold rush except the gold (data) is easy to make, and very very
cheap. hmm better sell shovels.

~~~
stereolambda
Yeah, people say that, but it's hard to see what selling shovels would amount
to. I mean, hardware/cloud is commoditized, ML software is free. Maybe Uber
for annotators /s (read, Mechanical Turk).

(Of course, there is the ever more popular route of making services for
enterprises wanting to outsource basic stuff.)

~~~
mchen076
Realistically the "shovels" are the compute services. And the shovel makers
are really doing well.

What you don't really see is the ML startups making billions yet. What you do
however see is companies like Microsoft with their Azure product and Amazon
with AWS Compute making bank off the tech startups doing ML. I really think
the people who are cashing in the most with the ML craze are the cloud/compute
providers.

I would be curious to see just how much VC money is being passed indirectly to
MS Azure through ML startups.

~~~
buboard
> the cloud/compute providers

And Nvidia

------
RocketSyntax
Aggregate some kind of data during your normal business operations. Then
produce insight from that data.

------
sjg007
Collect and sell the data is a strategy.

------
1_over_n
because VCs believe it gives them defensibility

------
copperfitting
Strongly agree!

------
codesushi42
_Why is it that a lot of startups seem to be betting the farm, so to speak, on
"machine learning" as their core value_

Because it is the current overhyped thing, and there are enough dumb investors
to fund it.

