Google's knowledge graph is much, much better than any of the open data competition because they have done the work to make it consistent (not in the ACID sense but in the completeness sense)
For example, Wikidata appears good on the surface, but as soon as you try to build against it you find huge holes in the data.
As a more specific example, the most common example you will see on Wikidata is "list the cities with a female Mayor in order if population." Great, except it turns out that many (most?) cities aren't marked up with the attribute that makes them considered cities for the purpose of that query.
Knowledge APIs add typing to search. That's really important because it let's you disambiguate queries well (Apple computer vs Apple fruit) and behave more intelligently based on that type.
Things like the DDG API (mentioned in this thread) don't do that. DBpedia/Wikidata/Yago do it, but so inconsistently that the benefits are hard to make useful (as you are coding for the multiple ways types are handled).
Wikidata has a different license, it's just a PR piece that means little. How many facts have been imported since Jan 2015? Wikidata is at 15,473,837, Freebase (summer 2015) is at 3,146,939,673. Basically Google's Freebase shut-down throw AI research back at least two years (assuming Wikidata can catch up to 90% of Freebase size in 2018). Now Google, Microsoft and IBM have an competitive advantage - each has its own closed knowledge base.
A stale 6 months old Freebase dump gets more useless over time. Wikidata has a different license, it's just a PR piece that means little. 15,473,837 (Wikidata) vs 3,146,939,673 (Freebase) - little has changed since Jan 2015.
> would it be correct to say the issue is Google isn't releasing their new data for free?
How about: Google shut-down a knowledge-base that was curated by a community, that provided regular data dumps, an online interface and an API - all with an open license (the original data source is nevertheless Wikipedia et al). Google's new venture is basically the same core technology and data but the crawler run also over the scrapped web content. And the only access for non-Googlers is via an API. Make your own conclusion from that.
That 3,146,939,673 number is the number of statements (triples), not the number of resources (which is the Wikidata number). Wikidata has 900M statements, not 15M[1].
Again, the Google Knowledge Base is much more than an expanded Freebase. It uses Google's Knowledge Vault project to extract from sources outside Freebase, as well as to evaluate and update the Freebase resources. To quote:
In particular, KV has 1.6B triples, of which
324M have a confident of 0.7 or higher, and 271M have a
confidence of 0.9 or higher. This is about 38 times more than
the largest previous comparable system (DeepDive [32]),
which has 7M confident facts (Ce Zhang, personal communication). To create a knowledge base of such size, we extract
facts from a large variety of sources of Web data, including
free text, HTML DOM trees, HTML Web tables, and human annotations of Web pages. (Note that about 1/3 of the 271M confident triples were not previously in Freebase, so we are extracting new knowledge not contained in the prior.)[2]
The second result in a query for "Rogan Josh" is "Jeremy Clarkson" a Person of "Top Gear" fame. Digging up the freebase record doesn't show any obvious reason why this would happen.
So will Google release open data dumps? Google Knowledge Graph is based on data what was known as Freebase (see various papers). Google is about to shut down Freebase at the end of 2015. Freebase got bought by Google and was kept open and had a great community. (WikiData is still several magnitudes too small to be an alternative)
And stalled Freebase data from summer 2015 gets more useless every day.
Microsoft bought Powerset (for Bing and Cortana AI), IBM recently bought Blekko (for Watson AI). Google closed Freebase and reuse it for KnowledgeGraph (GoogleNow AI and Search). That recent development hurts independent AI research and smaller AI companies.
Well, well, well. I know what I'll be playing with over the holidays. Some get Legos®, some get knowledge graphs.
But seriously, I've been playing around with Wikidata's Query Service[0]. Here's an example...[1], the example asks, "What is `nature' a part of?" (Once you click through the URL shortener you can click execute to run the SPARQL[2] query. SPARQL's a W3C recommendation, sort of like SQL but for triplestores, but its details are not readily graspable I think.)
It seems like Google's Knowledge Graph is based on Wikidata? I used to think the Semantic Web was always going to be a decade away but now I think that it is going to play a large part in the near future of the web though if you pushed me to explain my change in reasoning I don't think I'd be able. What we need are Semantic Web Browser, no idea what they'd look like though :(
Here is Wikidata's table of properties from which it builds up its entire knowledge graph[3]. I think it's fascinating.
Google's knowledge graph was in part based on Freebase, a large open knowledge base written by Metaweb. Google acquired Metaweb, continued to grow it's triple extractors [1] and eventually shut write access to the graph. Wikidata is slowly extracting information from the last public version of Freebase to grow out their own knowledge base.
There are tons of RDF sources out there if you'd like:
http://dbpedia.org/sparql is generally accepted as a better curated resource (both in quality and quantity)
http://data.linkedmdb.org/http://www.rdfdata.org/
I think the some subsets of the US federal government releases their data structured as such too
Most of Google's data is scraped from Wikipedia, so either/or is probably pretty similar. I assume Google ranks the results of the data better, mind you.
For a second I thought this actually gave access to the graph, so you could actually traverse it a la 'see the edges, visit the nodes', which surely would have piqued my curiosity. This doesn't seem to offer much more over HN user/DuckDuckGo operator Gabriel's API[1]. I have a console app I've written that queries WolframAlpha and DDG. Between the two, > 90% of my 'fast questions' get answered (with the added bonus of a decent privacy policy).
I used Knowledge Graph when I was a contractor at Google. Graph queries are not cheap, entity lookups should be inexpensive.
I am very happy that Google opened up this API. I used to use Freebase, and I use DBPedia a lot. When I get home from traveling I am looking forward to kicking the tires of the new API.
I'm curious how the knowledge graph API performs disambiguation without any context. E.g., if you search for `Apple` will it return the company or the fruit?
The current service returns a ranked (with scores) list of up to 200 entities. You can specify a type in your query or filter the results to select types of interest (e.g., Person, Place or Organization). The top result for 'apple' is the Corporation 'Apple, Inc.' and #2 is Thing 'apple' (a fruit). The score is probably based on a graph popularity metric (e.g., number of inlinks) possibly augmented by pagerank. Interestingly, the knowledge graph ID is the same as the Freebase MID and the results for the KG search for 'apple' appear to be a subset of a similar Freebase search and also in the same order.
I don't see how it could. If you search Google for apple, or even ask a person to give you information about the term "Apple", how can they give you what you need without further context?
The Knowledge "Graph"[1]offers an optional disambiguation parameter you can query with[2]. DuckDuckGo (I swear I'm not a shill or associated with them!) offers a disambiguation API out-of-the-box and integrates some of the RDF material I mentioned below. Here's your "Apple" example[3].
Though, based on the amount of data Google has on the average user, and the fact that you have to sign-up to get an API key which is presumably associated with your search history, Gmail history (either any conversations sent from your Gmail account, or any mail you received dispatched from a Gmail account directed at you), they could easily determine if you meant Apple the fruit [you work for the USDA], Apple the company [you're an engineer in SF with a User-Agent history that's very heavily skewed towards Safari], or etymological basis of Apple, the word [you're a linguist], and disambiguate based on aggregate information. I'd imagine it'd be pretty trivial to do with their existing advertising profile + visit history of any site that either has Google Analytics or a Doubleclick ad.
[1] Again, I struggle to call it a graph, even if it's implemented as a GDB on Google's end, until the end-user traverses it, it's just a Knowledge API.
Really cool, thank! The `types` seem to be a good way to add context if you know, a priori, the type you're looking for.
Google definitely fuses user data into their knowledge graph. This is seen in Freebase's `g.` identifier [1]. I'm curious if they'd influence their publicly facing API algorithms using that data.
Completely agree! I'm curious if the `query` parameter in the API performs well on long queries (with context) or if it needs to be focused to a single entity's name
Google's knowledge graph is much, much better than any of the open data competition because they have done the work to make it consistent (not in the ACID sense but in the completeness sense)
For example, Wikidata appears good on the surface, but as soon as you try to build against it you find huge holes in the data.
As a more specific example, the most common example you will see on Wikidata is "list the cities with a female Mayor in order if population." Great, except it turns out that many (most?) cities aren't marked up with the attribute that makes them considered cities for the purpose of that query.
Knowledge APIs add typing to search. That's really important because it let's you disambiguate queries well (Apple computer vs Apple fruit) and behave more intelligently based on that type.
Things like the DDG API (mentioned in this thread) don't do that. DBpedia/Wikidata/Yago do it, but so inconsistently that the benefits are hard to make useful (as you are coding for the multiple ways types are handled).