
Ask HN: Who's buying all the big data? - nightcracker
With every service, platform, software package, god knows what collecting data on you, I can&#x27;t help but ask: who&#x27;s buying all this data?<p>It&#x27;s clear in the case of massive players such as Facebook and Google that don&#x27;t have to sell their data. They can just keep their data and directly offer targeted marketing to advertising.<p>But what about smaller players, that don&#x27;t have a strong &#x27;big data&#x27; presence and product? Examples such as:<p><pre><code>  - online only NVIDIA drivers  
  - anti-virus software  
  - chat apps such as Skype, Discord, WhatsApp, etc  
  - online sites such as Github, StackExchange, etc  
  - &lt;shady mobile app #1402&gt;
</code></pre>
All of the above collect data about you in some form of another, it&#x27;s a staple nowadays. Who&#x27;s buying all this data? For how much?
======
dsacco
Mostly hedge funds and market research companies. Less frequently, companies
looking to bootstrap computer vision or natural language processing systems
without sourcing their own data.

If you productize the data as a forecast, it can cost between mid four figures
to low six figures per quarter per client, depending on the data and its
signal. Raw data is not particularly lucrative or useful, but developing a
predictive forecast with a very low margin of error based on an exclusive
source of data is both.

Smaller companies who find themselves with marketable data tend to partner
with well-networked financial research companies such as 7Park. Sometimes they
sell directly to clients if they have the network for it. In my experience,
sophisticated quantitative hedge funds mostly internalize their data sourcing
initiatives instead of purchasing from external vendors. For example,
Renaissance Technologies and Two Sigma both have massive data sourcing and
analysis operations internally, off the top of my head. They very rarely
engage with outside data vendors (particularly the former).

All of this falls under what is usually termed, "alternative data." There are
basically two classes of data vendors. First, you have Foursquare, Meraki
(prior to its acquisition), Airmail, etc. Basically, any company which
provides a free service for things like geolocation, inbox decluttering or
financial account/statement aggregation is most likely reselling your data.
For example, Yodlee resells customer receipt data. These companies source
their data from a public facing product with a sufficiently strong moat to
make the data exclusive.

The second class of vendor curates data from a wide variety of sources and has
no public product from which to source their own data. For example,
SecondMeasure provides sophisticated analysis of the kind of receipt data from
Yodlee. YipitData is very active in this work (though they're not actually
very good at it). Most of the very good companies doing this work are pretty
under the radar - they tend not to be VC-funded or well-publicized. This class
is more easily replicated as an internal initiative in a hedge fund, because
all of the data is technically public, just very hard to find and analyze.

~~~
WingH
Why is YipitData not very good?

~~~
dsacco
There are several reasons, in my opinion:

1\. They are very "loud" \- VC funding, visibility and large headcount are
antagonistic features for a firm specializing in boutique data mining. It's
not impossible, it just makes it harder to maintain a competitive edge in e.g.
how long your data remains exclusive. On the other hand, these qualities are
excellent for developing a moat around a product from which you can source
data. Their website gives away quite a lot of information.

2\. I have personally heard from analysts who mentioned that various forecasts
of theirs were rushed, superfluous or simply incorrect.

3\. I have personally "beaten" their team in finding, curating and analyzing
data to produce a forecast which was in high demand by various clients.

It's not uncommon for funds to buy redundant forecasts from several vendors,
so research firms tend to be at least peripherally aware of one another. Keep
in mind that the more exclusive a source of data is, the more lucrative a
forecast will be if it has a high signal. The two extremes are data which
essentially only you are in possession of, and data which has no signal due to
widespread diffusion in the market. Eventually your forecast slides from the
first sort to the second sort, and letting other companies know what kind of
data you have is inviting them to compete.

~~~
WingH
Care to share type some examples of data you've found that was used to produce
financial forecasts?

------
muzani
I did a grocery/food app. Grocery stores buy massive amounts of product data.
They need to know what to put on the shelves, what to put on what shelf, what
not to put on the shelf, which products are trending. Since they make very low
margins, this is one of the main differentiators in income. A single store
also has cash flow in the millions, so they're willing to spend a lot.

Another major use is that it builds it as a wall around the product users. It
helps to recognize user patterns, build a better product, and keep them from
going to similar products, e.g. Netflix.

I think a lot of people look at "big data" as some kind of gold mine, but
really it's just a rock mine. You have to know which rocks are in demand by
whom before you start mining all of it.

~~~
matt_the_bass
> I think a lot of people look at "big data" as some kind of gold mine, but
> really it's just a rock mine. You have to know which rocks are in demand by
> whom before you start mining all of it.

I think this hits the nail on the head. Knowing how to extract value is
important. Not just the raw data.

------
trekking101
One person's metadata is another's data! You (may be unintentionally)
characterise it as a market where you check boxes on "datasets" to buy, but
it's far from that. Monetizing by licensing/selling some kind of log data
isn't really interesting--but understanding what signal can be derived and how
it can be applied to an opportunity (arbitrage, value added analysis,
financial product/insurance etc.... is where it gets interesting.

I've written a bit about derivative uses of data if you are interested:
[https://post-employment.com/category/business/](https://post-
employment.com/category/business/)

