Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Who's buying all the big data?
15 points by nightcracker 4 months ago | hide | past | web | favorite | 7 comments
With every service, platform, software package, god knows what collecting data on you, I can't help but ask: who's buying all this data?

It's clear in the case of massive players such as Facebook and Google that don't have to sell their data. They can just keep their data and directly offer targeted marketing to advertising.

But what about smaller players, that don't have a strong 'big data' presence and product? Examples such as:

  - online only NVIDIA drivers  
  - anti-virus software  
  - chat apps such as Skype, Discord, WhatsApp, etc  
  - online sites such as Github, StackExchange, etc  
  - <shady mobile app #1402>
All of the above collect data about you in some form of another, it's a staple nowadays. Who's buying all this data? For how much?

Mostly hedge funds and market research companies. Less frequently, companies looking to bootstrap computer vision or natural language processing systems without sourcing their own data.

If you productize the data as a forecast, it can cost between mid four figures to low six figures per quarter per client, depending on the data and its signal. Raw data is not particularly lucrative or useful, but developing a predictive forecast with a very low margin of error based on an exclusive source of data is both.

Smaller companies who find themselves with marketable data tend to partner with well-networked financial research companies such as 7Park. Sometimes they sell directly to clients if they have the network for it. In my experience, sophisticated quantitative hedge funds mostly internalize their data sourcing initiatives instead of purchasing from external vendors. For example, Renaissance Technologies and Two Sigma both have massive data sourcing and analysis operations internally, off the top of my head. They very rarely engage with outside data vendors (particularly the former).

All of this falls under what is usually termed, "alternative data." There are basically two classes of data vendors. First, you have Foursquare, Meraki (prior to its acquisition), Airmail, etc. Basically, any company which provides a free service for things like geolocation, inbox decluttering or financial account/statement aggregation is most likely reselling your data. For example, Yodlee resells customer receipt data. These companies source their data from a public facing product with a sufficiently strong moat to make the data exclusive.

The second class of vendor curates data from a wide variety of sources and has no public product from which to source their own data. For example, SecondMeasure provides sophisticated analysis of the kind of receipt data from Yodlee. YipitData is very active in this work (though they're not actually very good at it). Most of the very good companies doing this work are pretty under the radar - they tend not to be VC-funded or well-publicized. This class is more easily replicated as an internal initiative in a hedge fund, because all of the data is technically public, just very hard to find and analyze.

Why is YipitData not very good?

There are several reasons, in my opinion:

1. They are very "loud" - VC funding, visibility and large headcount are antagonistic features for a firm specializing in boutique data mining. It's not impossible, it just makes it harder to maintain a competitive edge in e.g. how long your data remains exclusive. On the other hand, these qualities are excellent for developing a moat around a product from which you can source data. Their website gives away quite a lot of information.

2. I have personally heard from analysts who mentioned that various forecasts of theirs were rushed, superfluous or simply incorrect.

3. I have personally "beaten" their team in finding, curating and analyzing data to produce a forecast which was in high demand by various clients.

It's not uncommon for funds to buy redundant forecasts from several vendors, so research firms tend to be at least peripherally aware of one another. Keep in mind that the more exclusive a source of data is, the more lucrative a forecast will be if it has a high signal. The two extremes are data which essentially only you are in possession of, and data which has no signal due to widespread diffusion in the market. Eventually your forecast slides from the first sort to the second sort, and letting other companies know what kind of data you have is inviting them to compete.

Care to share type some examples of data you've found that was used to produce financial forecasts?

I did a grocery/food app. Grocery stores buy massive amounts of product data. They need to know what to put on the shelves, what to put on what shelf, what not to put on the shelf, which products are trending. Since they make very low margins, this is one of the main differentiators in income. A single store also has cash flow in the millions, so they're willing to spend a lot.

Another major use is that it builds it as a wall around the product users. It helps to recognize user patterns, build a better product, and keep them from going to similar products, e.g. Netflix.

I think a lot of people look at "big data" as some kind of gold mine, but really it's just a rock mine. You have to know which rocks are in demand by whom before you start mining all of it.

> I think a lot of people look at "big data" as some kind of gold mine, but really it's just a rock mine. You have to know which rocks are in demand by whom before you start mining all of it.

I think this hits the nail on the head. Knowing how to extract value is important. Not just the raw data.

One person's metadata is another's data! You (may be unintentionally) characterise it as a market where you check boxes on "datasets" to buy, but it's far from that. Monetizing by licensing/selling some kind of log data isn't really interesting--but understanding what signal can be derived and how it can be applied to an opportunity (arbitrage, value added analysis, financial product/insurance etc.... is where it gets interesting.

I've written a bit about derivative uses of data if you are interested: https://post-employment.com/category/business/

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact