Google (and other large tech companies that have invested in AI internally) are probably the most valuable investors, because they have access, and can purchase, the largest and best data sets available. It's hard for AI start-ups to do anything without access to the right data sets, and large companies can and have better access to that data.
As a startup you have less leverage to access proprietary datasets. In fact, most of the companies we tried to work with wouldn't let us even access the data, let alone use it, unless we did everything on prem. Oh and one company even wanted to own half of our IP.
Your algos, pipelines, UI and UX might be the best but you still don't have the data to make it worthwhile on its own. If the dataset is really that good then giving a 50% stake (not necessarily the IP) might be very well worth it. It just comes down to the terms of the deal.
there's almost no scenario I can think of where giving up 50% of your IP is worth it. there would be almost no way to raise vc funds to hire more researches or devs.
I've been meaning to create a marketplace for data to solve this problem. The obvious issue is how to protect sensitive information (perhaps instead of selling cleaned data, the platform trains models for you on leased data). I think the problem will grow quickly as meaningful data and the tech that it enables becomes an ever growing barrier to entry against startups and slow moving enterprises alike.
> they have access, and can purchase, the largest and best data sets available
Google might have an advantage in personal data, that can be used for advertising and health, but when it comes to general data, such as image datasets and NLP datasets, they can be found in the public domain and are growing fast. There is just a specific, limited advantage to Google in datasets. Mostly for ads.
The best data set will in general only be as good as the raw data that was used to prepare it.
I think you underestimate just how far along Google is with respect to the huge amounts of raw data they handle. They've been around for 20 years now and amassed a lot of expertise handling all kinds of data imaginable at scale.
If you disagree, who would you say is ahead of Google wrt general data sets that are valuable?
That's true, but Google can also afford to acquire and monopolize data that other companies are sitting on but don't have the resources or talent to utilize internally.
This is exactly why this kind of stuff should be prohibited, and the datasets should be legally regulated.
This is not just an issue for startups, but also for independent researchers at universities, who often can’t even replicate the successes Google and co report.
Most studies currently done in AI by Google, Amazon, etc were never replicated, and likely never will be able to, because access to data is missing.