

Deconstructing Crunchbase: Understanding Public Self-Reported Data - iampims
http://jdmaturen.github.io/2013/deconstructing-crunchbase.html

======
danso
Crunchbase is a great data source, but there are considerable issues with it
not covered by the OP. It's not just self-reported bias, but problems with
mistakes in the data (including typos), mistakes of categorization, and
inherently limited structure.

The following below reads like waay too harsh of a criticism on Crunchbase
maintainers...none of its problems are out of the fault of TC, but of trying
to make a service that encompasses such a massive universe. This is not a "Oh,
what terrible data designers TC has" but, "Hey, CB is a great service, but you
must know of these limitations before making an analysis"

(Last I scoured the data, it was in October, so I apologize if I bring up
anything that has already been fixed).

One of the overarching problems is that Crunchbase is especially targeted for
startups, with a focus on investment rounds, and yet there's a considerable
number of entries for well-established companies, for which there is no
relevant data. I suppose it's easy enough to filter the complete database on
an inner join (return all companies for which there's at least one investment
round)

Related to that: Without the itemized valuation rounds, the company valuation
field (which is a column in the companies listing) is not very
helpful...because it depends on users actively updating that field with each
financial quarterly report or what have you. That's clearly not going to
happen, so perhaps that valuation field should be moved to an events table
where users can list the valuation for the company on a per quarter/year
basis.

And so on the topic of inflexible structure...the category field is woefully
limited...small single-focus companies aren't consistently categorized and
then when you get to the big multifaceted companies...Google and
Microsoft...how do you categorize them with a single word? "Internet"?
"Software"? "Enterprise"? Tags would be of best use here.

So the above problems are difficult to solve on both an internal level and for
your submitters...but here's an example that made me give up (for now) on
doing a thorough analysis of CB data:

Color: [http://www.crunchbase.com/company/color-
labs](http://www.crunchbase.com/company/color-labs)

Color is listed as "closing". Yet, AFAIK, it was "acquired"...even
TechCrunch's own reporting says so:

[http://techcrunch.com/2012/11/19/sources-apple-
paid-7-millio...](http://techcrunch.com/2012/11/19/sources-apple-
paid-7-million-for-color-labs/)

So I'm not arguing whether Color's acquisition is a "fail" on the same level
as shutting down, the point is that there is a difference in the startup
world, and CB needs to formalize that definition, because such distinctions
are among the most important datapoints that a startup DB can provide. The
ambiguity/error in Color's listing is worth pointing out because it was a
highly watched company, highly mocked, and highly discussed...that its CB
status wasn't noticed or fixed is a sort of bellwether for what the case may
be for many of the other companies in the DB.

For a fun data-digging project, do a query for recently started companies with
large investment rounds but by firms that seem to be one-offs...they are
either very interesting companies (i.e. fly under the radar companies that
have attracted large sums of money)...or, it's just made-up data.

OK, so what's great about CB data, besides its ambition and its generous terms
of service? Doing analysis based on investment rounds. I think it's safe to
say that startups have it in their best interest to tell the world how much
big firms have so far invested into them. And the structure of the investments
table forces the inputter to provide useful data, at least compared to the
other tables.

To reiterate, not a criticism of CB's efforts...but pointing out natural
limitations that exist so far. Would love to do a hackathon...or rather, a
"cleanathon" to make the data more uniform.

