
The Empty Promise of Data Moats - prostoalex
https://a16z.com/2019/05/09/data-network-effects-moats/
======
jandrewrogers
Most of the leverage is in the _efficiency_ at exploiting data, and not just
in technical terms but also operational economics. This is far more important
than the data itself for a fundamental reason that has not been broadly
internalized by the industry at large, and which makes it difficult to build a
data moat.

Almost any proprietary data model based on proprietary data can be
reconstructed with sufficient fidelity from unrelated external data sources to
be competitive with that proprietary data model, at the cost of being somewhat
more expensive to engineer all things being equal. I've demonstrated this many
times in practice. As a corollary, a company with sufficiently efficient end-
to-end data infrastructure could, in principle, commoditize all proprietary
data model companies. This company does not exist, yet, and it would require
some very specialized talent but every necessary ingredient exists.

This is a realizable endgame due to the reality that virtually all data
companies, for good and practical reasons, are strongly incentivized to build
on infrastructures that are literally orders of magnitude less efficient than
is possible in principle. A company that was purpose-built on exceptional end-
to-end data infrastructure engineering could capture much of the revenue in
these markets in a surprisingly short time by commoditizing the data model and
arbitraging efficiency.

~~~
untilHellbanned
I was with you at the beginning. Can you ELI5?

~~~
jandrewrogers
I can give it a try. :)

There is an assumption that unique data enables unique insights, which can
presumably can be monetized in some fashion. As long as no one else has access
to your unique data, you have pricing power for the unique insights. There are
_loads_ of companies, big and small, trying to execute this business model.

A problem with this model is that for practical purposes there are no such
things as "unique insights". The only thing unique data grants you is a cheap
path to a specific set of insights. For every set of "unique" insights, there
is almost always many sets of unrelated data sources that can be analytically
combined to deliver the same insights. In the slightly seedy underworld of
data model brokers, I've seen some very impressive examples of this. The way
these alternative data models make money, despite the creation process being
more expensive, is that they are positioned as a cheaper alternative to
companies that think "unique data" means they can extract monopoly rent. The
explosion in data source availability has slowly made these clever
alternatives more prevalent.

Currently, the alternative to having access to the unique data is to do
significantly more expensive computations on what are typically larger and
more diverse data sets. This keeps it from becoming a runaway race to the
bottom due to the higher cost. Note that in both cases, the parties are using
conventional data infrastructure stacks with their implied limitations.

In recent years I've done extensive studies of the cost structures of these
types of businesses. It turns out that if you can reduce the end-to-end data
infrastructure costs by an order of magnitude for the reconstructive approach
then your total costs will be _far_ below the break even point for the
conventional "unique data" approach. Furthermore, the computational work
required to replicate one high-value data model is substantially reusable for
other data models, so the more data models you reconstruct, the lower the
marginal cost of reconstructing additional data models this way. Done at scale
across enough data models, the amortized cost of reconstructing unique data
can be less than using the unique data! People have been idly thinking about
what it would take to do this for a few years. Collecting the rare skillsets
required to pull it off is a major hurdle for any company.

The notion that it is possible to reduce end-to-end data infrastructure costs
by (at least) an order of magnitude relative to conventional data
infrastructures is well-supported but it raises another question: why can't
the "unique data" companies do the advanced engineering required to have such
an infrastructure themselves (it isn't something you can do with open source
currently)? The simplest answer is that it is difficult to justify extremely
expensive engineering efforts outside of an organization's core expertise
solely to prevent erosion of the market value of their unique data.
Fundamentally, it is a shift to competing on infrastructure instead of data,
which is an improbable transition for companies.

~~~
ntoshev
> For every set of "unique" insights, there is almost always many sets of
> unrelated data sources that can be analytically combined to deliver the same
> insights. In the slightly seedy underworld of data model brokers, I've seen
> some very impressive examples of this.

If you could share a couple of examples it would help a lot to get your point
across.

FWIW, I agree with the thesis that data intensive computation could be one or
two orders of magnitude more efficient than it currently is, with sufficient
engineering. Probably Cassandra vs ScyllaDB is a good public example, and
ScyllaDB is likely not close to the theoretical optimum at all. But I'm not
sure about deriving data from alternative sources. How do you derive movement
data for everyone with an Android phone if you're not Google?

------
PaulHoule
The best analogy from "data is the new oil" is that a data breach or privacy
event is like the Exxon Valdez.

~~~
sdoering
Working daily with clients who expect a lot from data. Especially when they
quote that sentence, I tend to ask them what their car uses as fuel. Do they
really fill their cars with raw oil or a product that was refined and turned
into usable gasoline.

Data might be the new oil (I doubt it). But you can't use it. You need to work
with this raw material and turn it into an endproduct - depending on the use
case.

------
charleyma
"Most discussions around data defensibility actually boil down to scale
effects, a dynamic that fits a looser definition of network effects in which
there is no direct interaction between nodes."

Good distinction between scale vs network effects, not every company with
scale has a network effect...

------
walterbell
_> Generating synthetic data is another approach to catch up with incumbents
housing large tracks of data. We know of a startup that produced synthetic
data to train their systems in the enterprise automation space; as a result, a
team with only a handful of engineers was able to bootstrap their minimum
viable corpus. That team ultimately beat two massive incumbents relying on
their existing data corpuses collected over decades at global scale, neither
of which was well-suited for the problem at hand._

Generated data is being reverse-engineered by machine learning? With both the
generation and ML written by the same team?

~~~
skybrian
Generated data can be used to increase variety as a way to avoid overfitting.
A simple example might be translating or rotating images.

Not an expert, but I'm guessing it must be easier to do this than to improve
how the machine learning generalizes from less data.

~~~
hahajk
Usually when someone says “synthetic data” they mean CGI, not simply
transformations of existing data. Using synthetic data is fraught (and
presumptuous), as you are assuming you understand the problem domain 100% and
are also extremely good at reproducing it. There’s a chance the model is using
something specific to the CGI (and not the general reality) to produce its
results.

For winning a computer vision competition it’s probably ok but I’d be very
careful about using synthetic data for systems I cared about.

~~~
TeMPOraL
I thought "synthetic data" it's something that rarely shows in training image
recognition, and is more like randomly generated user data (name, surname,
etc.) or data generated from simulations of some processes?

~~~
johnnycab
>I thought "synthetic data" it's something that rarely shows in training image
recognition

On the contrary, it is used to train models but it cannot adequately capture
the long tail of _weird_ events in the real world. Hence, it cannot be relied
upon, as alluded to by the parent commenter. With reference to using data
collected from a simulated environment vs real world ─ this subject was
discussed at some length by Elon Musk and Andrej Karpathy, at the Tesla
Autonomy Day event a few weeks ago.

~~~
skybrian
I generally agree that there is no substitute for experience running in
production and more data is better - or at least should be, if you can figure
out how to take advantage of it.

The thing is, when it comes to weird events, historical data can't be relied
on either. The next weird thing may never have happened before.

Predicting the future is hard no matter what you do. Gathering more data and
learning more efficiently from what you have are both important. Training on
artificial challenges can also be useful.

------
motohagiography
I'm really glad a16z put this issue to rest because it's the kind of problem
that causes insane wheel spin at a certain kind of company. The "Data
Acquisition Cost", and "Incremental data value," made me laugh because I had
to solve that problem before.

The most interesting related work on this is a paper covered on HN at some
point ([http://blog.mldb.ai/blog/posts/2016/01/ml-meets-
economics/](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)) about
economics' indifference curves and ML's ROC curves.

The big mistake I've seen data companies make is they approach their market
based on customer vertical, assuming that say a Bank will be like other banks,
and a health care provider will be like other health care providers, and this
is the fatal error. The bank with the same sensitivity to fp/fn/tp/tn rates
will have more in common with the health care company with that same
sensitivity than it will with other banks.

Basic problem with any data product is the customer's ROC curve, or where they
economically benefit from using your data service. Different customers have
different sensitivities to false positive/negative, true postive/negative
rates - and the customer categories themselves are defined by this
sensitivity. e.g. What they have in common is not their vertical but their
risk appetite. I have a blog post 3/4 written on this specific topic.

That sensitivity is specifically an artifact of the customers growth stage as
a company, which determines their risk appetite and the economics of the
asymmetrical value that the effective ROC curve of your data product
describes. (see above link).

This is the fundamental problem for an ML/AI company, where they will go
bankrupt trying to find their 2nd or 3rd insurance company customer because
they think the value of their product is because their next customer is in the
same vertical - not because they have the same sensitivity to fp/fn/tp/tn.

Slight aside, it's so important that an investor like these can weigh on in
these issues and other technical economic factors, because IMO when I listen
to every technical person I know, the #1 cause of internal suffering at
companies is caused by people trying to bullshit their investors, and blog
posts like these just wipe away a big source of that temptation.

------
naveen99
The good thing is there are infinite functions, not just infinite data. So
even if you limit yourself to finite data, you can do interesting things.

------
evrydayhustling
Amen to this. Enterprises are aware their data is valuable and increasingly
have top of the line devops and pipeline tools to manage it. Pendulum is
swinging towards data living with vertically integrated brands rather than
horizontal services.

------
sandGorgon
how does this reconcile with stuff like this ?
[https://factordaily.com/indian-data-labellers-powering-
the-g...](https://factordaily.com/indian-data-labellers-powering-the-global-
ai-race/)

is this very contextual to business space (like enterprise startups) which
A16z clearly mentions ?

------
naveen99
the nice thing is that computer power is outpacing the growth in human
population finally. An enterprise data startup can actually collect, store,
and process some finite amount of data on all 7.5 billion people on the
planet. Just find an interesting angle and be better at processing that data
then your competitors.

~~~
maxxxxx
I would call this very scary. Soon a single entity can do full surveillance of
the whole planet at reasonable cost. what could go wrong?

~~~
johnsimer
Yes, but we should also acknowledge the upside. Soon (or now) a single entity
can benefit/innovate for the entire planet at a reasonable cost. What could go
right?

~~~
wwweston
As the saying goes, power makes people benevolent, absolutely power makes
people absolutely benevolent, right?

