Google (and other large tech companies that have invested in AI internally) are ...

ganeshkrishnan · on July 12, 2017

>. It's hard for AI start-ups to do anything without access to the right data sets,

Exactly. As a founder of an AI focused startup, it is so hard convincing VC's that data can be as valuable as revenue.

Some VC's don't even bother beyond the screening call If there is less revenue although the data we gather in the process is more valuable.

Google very well knows the true value of data and hopefully they can shake the VC world for AIs

deepGem · on July 12, 2017

As a startup you have less leverage to access proprietary datasets. In fact, most of the companies we tried to work with wouldn't let us even access the data, let alone use it, unless we did everything on prem. Oh and one company even wanted to own half of our IP.

_e · on July 12, 2017

Your algos, pipelines, UI and UX might be the best but you still don't have the data to make it worthwhile on its own. If the dataset is really that good then giving a 50% stake (not necessarily the IP) might be very well worth it. It just comes down to the terms of the deal.

forgotmysn · on July 12, 2017

there's almost no scenario I can think of where giving up 50% of your IP is worth it. there would be almost no way to raise vc funds to hire more researches or devs.

jameslk · on July 12, 2017

I've been meaning to create a marketplace for data to solve this problem. The obvious issue is how to protect sensitive information (perhaps instead of selling cleaned data, the platform trains models for you on leased data). I think the problem will grow quickly as meaningful data and the tech that it enables becomes an ever growing barrier to entry against startups and slow moving enterprises alike.

astrojams · on July 12, 2017

Check out data.world

visarga · on July 12, 2017

> they have access, and can purchase, the largest and best data sets available

Google might have an advantage in personal data, that can be used for advertising and health, but when it comes to general data, such as image datasets and NLP datasets, they can be found in the public domain and are growing fast. There is just a specific, limited advantage to Google in datasets. Mostly for ads.

nl · on July 12, 2017

The largest, most interesting recent public datasets in image and NLP were released by Google.

For example, here are some of their recent NLP datasets: https://github.com/google-research-datasets

In images, OpenImages is theirs, and there are assorted ones derived from YouTube.

Stanford's SNLI is the most recent non-Google NLP dataset which is getting used a lot. Babi (from FB) too, if you count that as NLP

sah2ed · on July 12, 2017

The best data set will in general only be as good as the raw data that was used to prepare it.

I think you underestimate just how far along Google is with respect to the huge amounts of raw data they handle. They've been around for 20 years now and amassed a lot of expertise handling all kinds of data imaginable at scale.

If you disagree, who would you say is ahead of Google wrt general data sets that are valuable?

forgotmysn · on July 12, 2017

That's true, but Google can also afford to acquire and monopolize data that other companies are sitting on but don't have the resources or talent to utilize internally.

kuschku · on July 12, 2017

This is exactly why this kind of stuff should be prohibited, and the datasets should be legally regulated.

This is not just an issue for startups, but also for independent researchers at universities, who often can’t even replicate the successes Google and co report.

Most studies currently done in AI by Google, Amazon, etc were never replicated, and likely never will be able to, because access to data is missing.

phreeza · on July 12, 2017

Can you give some examples of such studies?