Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why do so many startups claim machine learning is their long game?
149 points by crtlaltdel on Nov 13, 2019 | hide | past | favorite | 74 comments
I work with and speak to many startups. When I ask questions around the product value, especially in context of defensibility, they assert that their "long term play is using machine learning on our data". This has been pretty consistent in the last few years, regardless of the nature of the product or the market for which the product is targeted. This typically comes with assertions such as "data is the new oil" and "once we have our dataset and models the Big Tech shops will have no choice but to acquire us".

To me this feels a lot like the claims made by startups I've encountered in past tech-hype-cycles, such as IoT and blockchain. In both of these areas there seemed to be a pervasive sense of "if we build it, they will acquire".

The question I have for HN is in two parts:

1. Why is it that a lot of startups seem to be betting the farm, so to speak, on "machine learning" as their core value?

2. Is it reasonable to be highly skeptical of startups that make these claims, seeing it as a sign they have no real vision for their product?

[EDIT]: add missing pt2 of my question

Because there is a real moat with data ownership and pipelines. If you want to do any analysis you quickly find that learning to properly use scikit-learn and tensorflow (or your machine learning algorithm of your choice) is atleast an order of magnitude lower of work than getting the data. For instance, I wanted to build a machine learning algorithm which took simple data from the SEC filed 10-Q and 10-K, which are freely available online and predict whether stocks were likely to outperform the market average for over the next 3 years.

Time to setup up scikit learn and tensorflow algorithms to make predictions: 4 hours. Time to setup python scripts which could parse through the excel spreadsheets, figure out which row corresponded to gross profit margin, and a few other "standard" metrics: ??? unknown because I gave up at about 80 hours trying to figure out rules to process all the different spreadsheets and how names were determined.

I had a professor who was doing machine learning + chemistry. He was building up his own personal database for machine learning. He spent ~5 years with about 500 computers building the database so that he would be able to do the actual the machine learning.

So much this!!! In all fairness, it doesn't matter what you pick up, you'll spend north of 80% of your time preparing and pre-processing data, which is almost always catastrophically tedious and boring. Annoyingly that also applies to publicly available datasets - pre-processing is still most of the work. For instance, the tensorflow team invested a lot of time and effort into tf.data for that reason but imo it doesn't make things a whole lot better.

I agree with you, though I actually personally don't find pre-processing data tedious and boring. I kind of like knitting it all together.

On a side note... data pre-processing is often viewed a side job that needs to get done before the real work can begin. I don't think I've ever been able to prepare a data pipeline without making decisions about the data that will impact the outcome. For example

How do you deal with missing data? Interpolate, ignore, use averages, use a machine learning algorithm to plug the gaps?

How do you decide what data sources to include in the pipeline. What if one data source seems more reliable, but another has far more data, too much to use. Should you amplify it, sample from the other data sets to make the volumes equal, keep things proportional?

What if one data set changes more rapidly than another, how should you update the data set used to populate the ML model?

These are just a few that jump into my head, and while there are techniques to deal with them, ultimately, there isn't a correct answer. And honestly, even these examples make it all seem more glamorous than it is, a lot of this is just figuring out why various encoding and formatting errors are breaking the feed, why column headers mysteriously change, why handwriting on form scans gets properly translated into text some of the time and completely garbled in others.

The funny thing is, I do see people fine tuning ML algorithms (kaggle style, seeing if they can wring a bit more predictiveness out of a model), when in the real world projects, decisions upstream about the data pipeline will have an impact perhaps 5-10 times greater than any tweak to the ML parameters (or even which general algorithm to choose). And yet, it's hard to get people to even pay attention to these decisions, probably - as you said - because people find it catastrophically tedious and boring.

As a corollary, "Data Scientists" who can't program their way out of a paper bag (e.g.: write simple SQL, or a scraper in Python) is near useless, and a stress on their peer who can.

I'd much rather hire a good programmer with some statistical knowledge than the other way around.

I've seen so much bad math and misunderstood statistics hard coded into widely-implemented software that that's not ideal either.

Ultimately, you need a team with a combination of strengths if your product requires multidisciplinary work. Otherwise you end up with hilariously wrong equations/assumptions in your code base. But, full disclosure, my background is math/science.

Either way, you are gonna be paying someone to fill in gaps in their knowledge. Anyone with a laptop and an internet connection can learn to "program their way out of a paper bag" in a weekend.

I kinda disagree, though I suppose it depends on the paper bag.

Yep, I actually started a company that cleans data reliably because of this. It’s usually a huge waste of resources for a highly paid person to clean data (like ML specialists) and most data cleaning companies just outsource to overseas resources that aren’t reliable or knowledgeable beyond an Excel skill set.

> data ownership and pipelines

not ML.

By your argument, the money is in data hoarding and brokering, and renting that data (with DRM) to ML outfits, not doing ML.

Anyway, cleaning dirty data isn't execeptionally hard, it's just boring work; getting the raw data is the hard part.

This is an interesting observation that deserves to be addressed.

I think in practice, what happens is that there's no easy transactional boundary that can keep the data ownership and ML in separate firms. The theory of the firm [1] states that firms arise when transaction costs are less than the economic inefficiencies of centralized resource allocation. There are some pretty heavy transaction costs between doing data cleaning and data science in separate organizations:

1.) DRM for datasets isn't really a thing, since to explore, visualize, and train on them, you need access to the raw data, and then instead of your machine-learning function you can just pass the identity function to get the raw data. DRM for consumers always relies on a publisher whose incentive is to stay in business (by not breaking any laws) rather than to obtain the raw data.

2.) You don't know whether a cleaned data set will be useful for your ML application until you've inspected it, visualized it, run some statistics over it, etc. at which point you've done most of the work for setting up your models. That means there's a big risk premium for buying data, and potential buyers usually want samples & statistics before committing a lot of money.

3.) At the price that good datasets go for, you're usually dealing with enterprise sales, which involves commissioned salespeople, face-to-face meetings, expensive dinners out, etc.

4.) The type of labeling you need to do is often intimately connected with the usage of the ML model. What constitutes spam? What constitutes abuse? These are questions for your policy team, which the data-collection organization would have no visibility into and no way to set up a one-size-fits-all policy that all potential customers would be okay with.

5.) Software markets tend to be winner-take-all, which means you're dealing with a handful of customers, and one will likely become dominant and able to acquire you.

Instead of separate firms for data ownership, ML, and user interface, usually the service that makes the final product will end up collecting or buying the data outright, hire data-scientists to do the ML, and then offer a product. The end result, economically, is what you say: the money is in data hoarding. But it isn't apparent in prices of a sustainable customer ecosystem, it's apparent in acquisition prices for startups with data vs. salaries of data scientists.

[1] https://en.wikipedia.org/wiki/The_Nature_of_the_Firm

> DRM for datasets isn't really a thing, since to explore, visualize, and train on them, you need access to the raw data, and then instead of your machine-learning function you can just pass the identity function to get the raw data.

Functional Encryption[1] is aiming at this exact problem. It's still really early stage, and it currently has a lot of caveats (the current algorithms are really expensive and only able to work with linear function AFAIK), but the field is moving rapidly, and we don't know how far it will get. There's also Fully Homomorphic Encryption [2], which is way cheaper and more versatile, but in that case, only the owner of the data can get the result of the calculation.

None of this is currently used at scale, but their is a lot of research on this field and some PoC are being built at Microsoft and IBM IIRC. There's also Cosmian[3], a French startup working on this topic. (Full disclosure, I know the founders)

[1]: https://en.m.wikipedia.org/wiki/Functional_encryption

[2]: https://en.m.wikipedia.org/wiki/Homomorphic_encryption#Fully...

[3]: https://cosmian.com/

easy transactional boundary that can keep the data ownership and ML in separate firms

There is. Data broking is a well established business model in finance. Every bank and hedge fund is running ML models on data licensed from a provider such as Reuter’s, S&P, etc there are dozens of such brokers.

Um, yes and no. Some of the data annotation problems get really hard and really expensive. Look at radiology or genomics. You literally need N people to die for more knowledge in some of these situations.

Data cleaning and data labeling are two separate tasks, with vastly different workflows. For example, getting structured data out of Whois requires a lot of cleanup. Labeling domains as “bad” might take forever.

Agree 100% - the single largest problem with applying ML to anything is getting (1) enough (2) accurate (3) representative (4) correctly labeled data to train and validate the models.

Everything else is noise. For any proposed ML project, the first question you should ask is "Where are you going to get the training data?"

If someone tells you "Oh, getting data will be the easy part", back away very slowly :)

Strongly agree with this.

One thing I’ll mention is that this is true both at the very early stages of a ML project, and even when an ML project is scaled up and in production. Oftentimes, the data pipeline is the true way in which a model will improve versus anything else, so it’s pretty critical that these data pipelines are setup to get an initial dataset but also to scale properly.

It’s one reason I started Scale (scale.com). It was viscerally clear that the real bottleneck to ML was getting the needed data, and in our case, annotating that data appropriately. It is very heartening to hear it echoed in this whole thread that data is very clearly what “matters” for ML.

You should've bought the processed 10q and 10k data ranging many years for all US stocks. I bet you can get that for under $1000.

This is the most basic securities data that exists in finance, perhaps you can even get it for free from some Yahoo or other retail data source.

Yes, but I was in graduate school making $1500/mo and was doing it mostly for fun as a class project. I was also very naive about how much effort it would take to process that data. Both google and yahoo have current data which is easy to get but not historical. Google didn't really help me find what I was looking for.

>Time to setup python scripts which could parse through the excel spreadsheets, figure out which row corresponded to gross profit margin, and a few other "standard" metrics: ???

If you just want basic financial data there are easier ways, e.g. the Quandl API (not affiliated).

But the bigger issue is that you can't accurately label your training data. A subsequent stock price move could have been due by something revealed in the 10-Q that the market somehow missed, or it could have been caused by a forest fire in Bolivia, or a competitor getting acquired, or any combination of near infinite external events that couldn't have been predicted from the 10-Q.

Strongly agreed. In my daily job, the exciting, shiny machine learning work is superfluous compared to the strong keyboard smashing of having to parse millions of rows of excel sheets with incomprehensible organization and column naming.

So this. The data problem is being controlled, fought, and negotiated at the level of nation-states and indeed whole meta-national cultures.

There are surely some startups for which this is bullshit. But the good version of it is:

- take some valuable task that's never been successfully automated before

- do it manually (and expensively) for a while to acquire data

- build an automated system with some combination of regular software and ML models trained on the data

- now you can do a valuable task for free

- scale up and profit

The risk is that it's hard to guess how much data you'll need to train an accurate, automated model. Maybe it's very large, and you can't keep doing it manually long enough to get there. Maybe it's very small and lots of companies will automate the same task and you won't have any advantage.

I think there'll be some big successes with this model, and many failures. So be skeptical -- ML isn't a magic bullet. But if a team has a good handle on how they're going to automate something valuable, it can be a good bet.

As an investor, you may well face the situation down the line "We've burned through $10M doing it manually, and we're sure that with another $10M we can finish the automation." Then you have to make a hard decision. With some applications like self-driving cars, it might be $10B.

> The risk is that it's hard to guess how much data you'll need to train an accurate, automated model. Maybe it's very large, and you can't keep doing it manually long enough to get there. Maybe it's very small and lots of companies will automate the same task and you won't have any advantage.

I think you forgot most important option. It may not be a data problem. You may have all the data in the world and still not being able to solve the issue.

Capabilities of ML are much more limited than hype makes people believe

I can't think of a company that wouldn't have valuable data. A core use of ML is reducing costs by making better spending decisions/reducing waste and that is relevant to almost every company I think.

Recurring expenses (i.e. the kind that will generate enough data to adequately train a ML system) are already handled pretty well by most businesses through traditional (non-ML) methods. Operations Research has been a thing for 60+ years. Businesses that don't handle their recurring expenses well now are likely to have organizational issues (e.g. senior management that ignores the advice of their reports on how to do things) that will prevent them from doing so even if ML is added to the mix.

The place where businesses get in real trouble (and hence would see significant value from better decision making) is when doing things that they have _not_ done 10K+ times before. Things like "Should we expand into $NEW_MARKET?". ML isn't going to help them with that, because there will not be any useful historical data to train them with.

I completely disagree with the assumption that because something has been done for 60+ years it won't be improved by new, extremely relevant technologies. Amazon is very effective at both OR and organizational dynamics, but they found massive savings by using ML to predict demand and thus inventory/costs (Research is here - https://arxiv.org/pdf/1711.11053.pdf).

Unless I missed it, the referenced paper doesn't quantify what savings Amazon achieved, or even if Amazon actually used the NN described. It does not support the statement "found massive savings".

No, that was from hearing them present their work.

But I do think the GEFCom2014 Electricity Forecasting benchmark is pretty clear proof that ML solutions such as these can improve expenditure decision making compared to established techniques.

This seems too easy a recipe to be worth it in the medium term - there is no moat. Better cover your data with very strict laws, like google does with its exclusive deals for medical data use.

Like you say it's the access to data itself that is valuable. At risk of oversimplifying: building models is the easy part. Plenty of smart people who can do that.

Data that is expensive to acquire is the best long-term play for an ML company. Either expensive due to regulations or expensive due to the quality of sources. Ideally both.

Google didn't add any "laws" around that data though.

The laws were already quite strict (HIPAA) and the data relatively inaccessible.

Google jumped through all the hoops to get it. And an exclusive contract never hurts no matter what the service. (Logistics, payment processor, etc.)

> an exclusive contract never hurts

Thats why this is unfair. The health industry incentivizes (and often mandates) open publishing of scientific results, but patient data is reserved with gold chains for the exclusive use of google

There's another risk - which is rare events. Basically most of the reason why these workflows are not automated already is that there are a host of "once a week per operative" corner cases that don't show in the data for a long time, and when they do show they don't show as anything but noise. But this is where the human intelligence is spent and this is why the jobs are so hard to automate.

$10B won't be nearly enough to build level 4+ autonomous vehicle control software that will work in a wide variety of roads and weather. Probably too low by an order of magnitude or two.

You forgot the second part :-D

I think the answer is pretty simple: it sounds good, and it's hard to challenge. It's essentially a promise that they will invent a black box with magic inside. Since nobody can see inside the black box, it's hard to argue that there isn't actually magic in it.

The long-term problem, of course, is that there aren't that many actual magicians in the world. Most of the people who bill themselves as magicians are either people who just think they're magicians, or people who know they aren't but don't mind lying about it.

Thanks! I have edited my post to account for my missing part2.

Ugh. Maybe I'm in a different world by now, but I dislike such statements on multiple levels.

> This typically comes with assertions such as "data is the new oil" and "once we have our dataset and models the Big Tech shops will have no choice but to acquire us".

Maybe it's me, but I dislike the attitude to work to be acquired. Interestingly, this is a rift I see quite a bit if I interview more development oriented guys, or more infrastructure oriented guys. Tell me I'm wrong, but development oriented guys tend to be more fast paced and care less about long-term impacts. Infra inclined guys tend to be slower paced, but longer-term oriented. Build something to last and generate value for a long time.

> 2. Is it reasonable to be highly skeptical of startups that make these claims, seeing it as a sign they have no real vision for their product?

From my B2B experience over the last few years, and working towards a stable business relationship with large European enterprises, yes. My current workplace is moving into the position of becoming a cutting edge provider in our place of the world. This is a point where machine learning and AI becomes interesting.

However, we didn't get here by fancy models and AI. We got here by providing good customer support, rock-solid SaaS operation, delivering the right features, strong professional services, and none of those features were AI. It's been good, reliable grunt work.

Different forms of AI are currently becoming relevant to our customers, because we have customers that handle 5k tickets per day with our systems and they have 3-4 people just classifying tickets non-stop. We have customers with 30k - 40k articles in their knowledge base, partially redundant, partially conflicting.

This is why we entered a relationship with a university researching natural language processing among other - and they will provide us with a big selling point in the future. And they are profiting from this relationship as well, because they are getting large, real world data sets they couldn't access otherwise. Even with a good amount of pre-processing by the different product teams.

But as I maintain, nothing of that form has brought us where we are.

> I dislike the attitude to work to be acquired.

This is my preferred method of doing business. I start a company with the intention of selling it to someone else in the end. I do this because it fits my personality the best -- once I have solved a technical problem, I lose interest in it and want to move on to the next thing.

However, there's a right way and a wrong way to do business with this sort of goal, and your comment here puts the finger on the difference:

> Build something to last and generate value for a long time.

This is an essential part of what I need to do in order to sell. It's not time to sell my company until the company is already profitable and set to work over the long term. That's where the real value proposition is for buyers -- I'm not selling a technology or piece of software, I'm selling an established business.

In my opinion, people who start businesses with the intent to dump them at the soonest opportunity are really just doing the same thing as people who make money flipping stocks. There's nothing wrong with that (it's just not for me), but it's a completely different sort of thing. It's money-spinning without the intention to build anything of lasting value.

I am really surprised seeing posts like yours downvoted

I agree with other points here, but I also think nobody wants to wake up 2 years from now and be the only company that was not investing in machine learning. It could turn into nothing, or it could be 100x the time and money put in now.

So, of the 4 outcomes:

(1) buy in now + worthless

(2) buy in now + 100x

(3) don't buy in + worthless

(4) don't buy in + 100x

(4) is a terrible position to be in, (2) is a good position to be in, and (1) only costs a little, and (3) is even. So when you are deciding to buy in or don't, you are deciding between good + little loss or terrible + even.

Sounds like a new version of Pascal’s Wager

John Carmack said exactly that in the FB post

By that logic you should definitly buy lottery tickets.

Yes you should, if the payoff is high enough or the odds are good enough.

Because they believe it will increase their chances of getting funding. The way ML is dragged in by the hairs in some propositions really is just painful. The funny thing is the few start-ups that I've seen that actually needed and used ML properly did so quietly because it gave them a huge edge over their competition who had not yet clued in to that fact.

Machine learning is a cost reduction multiplier for staffing costs, which are perhaps the highest cost for technology firms until they become successful.

Google and Facebook use algorithms to try and minimize dollars spent on moderating their sites, with limited success. If they can avoid paying human beings to make judgement calls, they save billions of dollars a year.

It’s reasonable to be skeptical of startups who claim that they’ll use ML someday. Ask them how they’re using human labor today to perform the tasks that they’d like ML to do, and what their runway is for that work at their current burn rate. If they don’t deliver a viable ML labor reduction by that time, they will either collapse or worsen their service.

The real moneymaker is what's under the hood of many machine learning models: data laundering and labor externalization.

Consider this: a good deal of open datasets are stolen. People had their faces taken without consent. The data was tagged and labelled for pennies per image on some server farm. Wrapping it up in machine learning can effectively black-box what it is you're actually doing.

Software products can be layered into 3 parts -

- data collection (frontend, APIs) - data storage (database, backend) - data visualization (dashboards, analytics, reports)

To bootstrap a startup, the primary people you need would be - a frontend person, a backend person, and a product person. In the beginning, you would be dreaming about the possibility of using ML but would not invest in hiring a data scientist at that stage. The second stage would be to hand over the reins to operations people and let them optimize the internal and external processes and get ready for the launch (growth). These phases can take-up any amount of time, from 2 to 5 years. Finally you handover the reins over to the salespeople and go back from product to services mode (especially true for enterprise software). During this stage, you would be hiring a data scientist and a team of data analysts. After, around 6-8 years of bootstrapping a company. Unless your product relies heavily on the ML algorithm, it can and would always wait.

Figuring out where to first use ML is another challenge. Hiring a data scientist and ask them to tell you what to do is a futile effort.

ML is a useful tech and if you have to keep thinking about utilizing this technology to improve your product and processes. It has to be a part of your long term efforts.

Most companies aren't interesting enough to invest in without some kind of secret sauce. Claiming that they are going to use 'AI and ML' is a way of saying that 'yes, we are a company with tons of competition and little competitive advantage, but we will have a secret sauce eventually. We don't know what that is yet, so we are using jargon as a placeholder.'

Call me a cynic, but I think the talk of building a data moat is mostly nonsense. Of the companies that make these claims, how many of them are actually hiring expensive data engineers to build the moat? If don't have a team working on it, then its a ruse.

Unless the startup is founded by people with deep experience in ML and have been working on using it extensively from day 0, it is unlikely that they will be able to deliver on this vision. They won't be collecting the data correctly if they are doing it at all. If they are collecting it correctly (they aren't), they won't be able to get it to a useable form. If they get that far (they won't), they then need to build out the ML Ops to deliver their models. Now, finally, they can `from tensorflow import *`. Engaging this process post hoc takes YEARS.

This reminds me of Altman's justification of his AI venture. Once he solves AGI, then all the monies is his, and whomever else was lucky enough to invest in his 'lightcone of the future'.

It seems AGI is the perpetual energy device or philosopher's stone of our era.

Enough so that I'd consider starting a hedge fund to short all the AGI companies :)

Because rule based software is now a commodity, and therefore has lower margin. ML is more risky to achieve, may not work on a given dataset, and requires scarce specialized thinking, therefore is worth betting capital on. ML is the new stock while software is the new bond.

Because they are probably temporarily using mechanical turks. If they admit that their evaluations will be terrible because that kind of company cannot scale like a SaaS company - it'll be valued like Professional Services which is "not good".

So if you say you're going the ML route your perceived value is much higher.

The key is that they’re not betting the farm on ML. Instead they’re betting someone else’s farm on ML.

It is one of the two obvious economies of scale for pure software companies (along with network effects). I can't think of a company that should not have ML in their long-term plan.

Some people probably talk about it without truly understanding it (it can take years before you have enough data to build good ML models, particularly if there are seasonal trends), but I certainly wouldn't judge a company poorly for seeing ML as an important long-term moat. I would judge a young company that sees it as a short-term moat - if you can acquire valuable data quickly, so can your competitors and it's not really an economy of scale that gives you defensibility.

The reason this is the case, is that one of the biggest gaps in the market is not the technology itself (most of this is open source or variations there of) but of proprietary data sets. This is valuable when transformed (as my company does at www.edgecase.ai ) to annotated data. What we see is that companies that a) invest in acquiring data and b) transforming said data and c) building a model that is useful to customers (and acquires their data) is the way of the future.

"Ai with unique datasets is an amazing moat"

So in sum:

1) in some cases this is true (but most of the time there is no unique data) 2) if they truly have unique data this is something to take notice of.

because VC money is hunting for ai startups, so everyone does what they did when it was data analysis, predictive intelligence, augmented reality and whatnot, they attach the buzzwords to the pitch in hope to get a foot in

ML does two things, ideally: - Makes meaning from data (this is customer value) - It insulates the raw data from competitors. So the customer gets their actionable insights from your algorithms and your competitors can't run algorithms on your raw data, trying to do a race to the bottom with you. This works in two scenarios: !) lifestyle businesses where it isn't worth it for would be competitors to generate their own data and 2) big projects where the first mover gets an unfair advantage from huge data sets.

My take on it is you can add so much more value to a product with ML, but to succeed you need to have a lot of data.

So you're getting into a "We need to grow really fast to get more data than our competitors so we can add ML and create more value to the product so we can grow faster and get ahead of competitors".

I.e. ML is a competitive advantage which is often hard to come by for startups.

From my experience if someone says they are doing such things it usually boils down to:

Machine learning == statistics

Deep learning == machine learning

1. I believe the term is “hand waiving” 2. Yes, or maybe worse they have no business model.

It’s the buzz phrase that requires being part of every corporate’s mission. Even my trash company is into machine learning.

It's like the gold rush except the gold (data) is easy to make, and very very cheap. hmm better sell shovels.

Yeah, people say that, but it's hard to see what selling shovels would amount to. I mean, hardware/cloud is commoditized, ML software is free. Maybe Uber for annotators /s (read, Mechanical Turk).

(Of course, there is the ever more popular route of making services for enterprises wanting to outsource basic stuff.)

Realistically the "shovels" are the compute services. And the shovel makers are really doing well.

What you don't really see is the ML startups making billions yet. What you do however see is companies like Microsoft with their Azure product and Amazon with AWS Compute making bank off the tech startups doing ML. I really think the people who are cashing in the most with the ML craze are the cloud/compute providers.

I would be curious to see just how much VC money is being passed indirectly to MS Azure through ML startups.

> the cloud/compute providers

And Nvidia

i think most ml jobs are for finding/cleaning data anyway . I suppose an "ebay of data" might be successful


^this is a tactic that i see as viable in all the hype-cycle spaces

6k4f²488 2g3=3t3 is

Aggregate some kind of data during your normal business operations. Then produce insight from that data.

Collect and sell the data is a strategy.

because VCs believe it gives them defensibility

Strongly agree!

Why is it that a lot of startups seem to be betting the farm, so to speak, on "machine learning" as their core value

Because it is the current overhyped thing, and there are enough dumb investors to fund it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact