

$199 Brings Google Knowledge Graph in Reach of Hackers and Startups - PaulHoule
http://basekb.com/?a=hn1

======
tonfa
Freebase isn't the same thing as Google Knowledge graph. Freebase contains 22M
entities (according to the wikipedia entry), while the knowledge graph
contains 500M entities (according to google announcement).

~~~
PaulHoule
Actually the 500M (or 400M or 300M) or whatever number is bogus.

We do know that they ported "graphd", the Freebase database, to Google's
infrastructure. Really nothing happened at Freebase for a year and a half
while they were doing this.

They've probably created a very large Freebase graph for scale testing and
it's likely that they're able to handle 500M entities from a hardware
perspective.

If you look at how often the GKG comes up in search, it clearly doesn't come
up very often. So the process that makes the visible GKG is a process of
subtraction, not addition. There are many topics where the GKG could give you
an answer from Freebase but it doesn't, and they prune many of them away
because they don't want to take the chance of giving wild answers. It's more
likely that 1 or 1.5 million topics are in the visible GKG.

~~~
ThisIBereave
Yes, but what you're providing is _not_ the Google Knowledge graph. You should
not claim that you are.

~~~
PaulHoule
It's not the same thing but it's the closest thing available.

I go around saying there's no ring to bind them all but that an arbitrarily
good aproximation can be built.

------
mark_l_watson
I used to use the Freebase RDF endpoint - that is nothing new.

That said, this looks very interesting and I expect businesses to grow by
selling public data in easier to use forms.

~~~
PaulHoule
The Freebase RDF endpoint had three problems:

(i) it would only let you look at one subject at a time, (ii) it would only
return a maximum of 100 facts about a subject (iii) it didn't use names
consistently so you couldn't match up properties and classes with the schema

It's all great for a hack day demo but not acceptable at all if you care about
building systems that give the right answer.

------
chintan
Great stuff! On the landing page, I would add some examples of what one can do
with this KB.

e.g. a startup in local space can use BaseKB to find various attributes about
a city and nearby towns etc. You can also price various subsets geared to such
niches.

Most startups re-invent the wheel when it comes to such KB/data issues. Show
them the math :- instead of spending 1/0.5 FTE time in aggregating/cleaning
data, they might as well shell out $199pm and get the job done.

------
xSwag
Hey, I would like to have some more information on exactly what type of data
you have?

>just about anything that would come up in a game of "20

I don't feel like that is clear enough.

~~~
PaulHoule
It's a machine readable encyclopedia with very good coverage of concepts that
are in people's shared conciousness.

If it were combined with a large document corpus (say CommonCrawl) you could
use it to build (i) something like DMOZ that needs little human input, or (ii)
a document analyzer that extracts named entities and topics, (iii) a knowledge
base to support "enterprise search" where the major problem is recall because
people don't use the right keywords, or (iv) many other intelligent apps which
I haven't thought of yet.

------
huggah
It should be noted that the RDF endpoint isn't the only way of getting data
out of Freebase.

Freebase provides complete weekly data dumps
(<http://wiki.freebase.com/wiki/Data_dumps>) and has instructions for post-
processing them; presumably this is exactly what PaulHoule is doing. As
chintan pointed out, it could still be a worthwhile service for you.

------
namidark
Seem's like it might be a better model to charge per # of fact usages (similar
to EC2 hourly usage, DB queries, etc)?

~~~
PaulHoule
The economical (and successful) model for handling data of this sort is batch
processing, possibly with some tool like Hadoop -- and our market research
shows that many potential customers are already using this.

If we provided a live database to customers we'd have to make at least 10
times the revenue in order to cover variable costs and I'm not sure we could
provide a service with satisfying performance at that level.

RDF technology is rapidly advancing and there's no substitute right now for a
customer providing his triple store on his own machine with a lot of RAM and a
few SSD drives.

------
vibrunazo
What exactly is the difference between using this and the freebase api? Are
they just converting data formats?

~~~
PaulHoule
The difference between SPARQL 1.1 and MQL (the propreitary Freebase language)
is like the difference between chess and checkers.

In SPARQL you can write queries that involve any graph relationships that come
into your head. You can take the UNION of multiple graph patterns. With SPARQL
you can also get back 200,000 or more results.

With MQL you quickly run into the wall when you find that you can't write the
query you want (you need to write 10,000 queries instead of one) or that the
query you want to run times out, or that it only lets you get back a limited
number of answers.

It's also possible to do batch processing with :BaseKB Pro with extreme
efficiency. We did a calculation that would have taken 100 years on the
biggest machine in the AMZN cloud in SPARQL in just 24 hours on a mac book pro
with an optimized pipeline.

~~~
tcwc
How about the Freebase quad dump? Can't I just import that into virtuoso and
do the same sparql queries?

~~~
itsnotlupus
You'll need to process that file into something that your DB can consume. If
all you need are minor syntax tweaks, you can stream straight through.

More likely, you'll need to rebuild the dump to be more than quads ordered by
arbitrarily assigned unique id, and with nearly 600 million quads, it's not
completely trivial to do, might involve temporary index files, etc.

Still, it's doable.

(also see: <http://code.google.com/p/freebase-quad-rdfize/> )

------
parsa28
How is this data different from what you could obtain by utilizing OpenLink
Virtuoso+Sponger (Freebase cartridge)+DBPedia live?

~~~
PaulHoule
overall accuracy is better

DBpedia is produced by a rube goldberg machine that parses (undocumentable)
wiki markup

if a fact is wrong in Dbpedia you need to do detective work to know if it's a
problem with the mappings or with Wikipedia. If it's in Wikipedia maybe the
editors will let you fix it on a good day.

Precision and recall for types like "Person" is much better in Freebase. Years
ago I spent a week in a half cleaning up data from Dbpedia to do something I
was able to do in 15 minutes with Freebase.

------
billycravens
Can anyone direct me to the new URL for Hacker News? I missed the announcement
where this site was converted to Hacker Ads.

