
Diffbot launches AI-powered knowledge graph of 1T facts - borisjabes
https://venturebeat.com/2018/08/30/diffbot-launches-ai-powered-knowledge-graph-of-1-trillion-people-places-and-things/
======
d--b
It's incredible people still try and do things like this.

This is such an over ambitious problem...

1\. Webpages are ladden with errors, how do you deal with this? 2\. Knowledge
does not fit in a graph. It's asymptically a graph, as in: I can define
relationships like: This recipe contains carrots. carrots contain sugar =>
this recipe contains sugar. Cool. But, what about this "sugar-free carrot cake
recipe?" Well it still contains carrots, so still contains sugar...
Contradiction? => requires human curating... 3\. It doesn't even solve a real
problem... Look at IBM watson, it probably knows a lot more crap than diffbot,
and yet, is a pretty useless piece of software...

~~~
moxious
Wanna thread other known attempts at this? Maybe someone will jump in with
extra detail about how these approaches are different, or what extra value
we'd expect.

\- OpenCyc [http://www.cyc.com/opencyc/](http://www.cyc.com/opencyc/) \-
DBPedia [https://wiki.dbpedia.org/](https://wiki.dbpedia.org/)

~~~
miket
Founder here. OpenCyc (as well as Freebase) are human attempts to enter and
curate a structured knowledge base. Likewise DBPedia is a set of scripts that
extract Wikipedia infoboxes (semi-structured data which is also human crowd-
sourced).

The Diffbot Knowledge Graph is built by applying computer vision and natural
language processing techniques to reading all the pages on the web (which can
be in any structure and human language) and extracting it into a structured
form, without the element of human annotation in the build pipeline.

~~~
moxious
Can you expand on major points of how this will make the content different,
(for example, Wikipedia is curated and non-notable people pages get thrown
out, so if you're reading all of the web, presumably you'd know about non-
notable people) -- and why it's better?

~~~
miket
Founder here. There are many differences in the result when you have an
automated system building a Knowledge Graph vs. a human one.

Obvious one is scale, Wikipedia has on order 10M entities and represents the
work of thousands of humans whereas the Diffbot KG has 10B entities and is
discovering about 120M each day, and is largely limited by the number of
machines running the algorithms in the datacenter. The properties and facts
indexed about each entity are also a superset because it is not limited to
those that would be worthwhile for a human to curate. Lastly, it can be more
accurate than facts found in a single source because the automated system
utilizes multiple sources of that fact found across the web to estimate a
probability of the accuracy of the fact.

The result is that you have a Knowledge Graph that is more useful for work and
business because they are the entities you interact with day to day, not the
"head" entities that optimize for popularity and the constraints of human
curation.

------
subhobroto
Mike,

fantastic work here! As someone who's really excited about a machine readable
web and have been working on it, this is fantastic! Unfortunately, while the
Semantic web was to tackle this, the real life proliferation of the Semweb has
been, atleast to me personally, extremely disappointing.

So this is a fantastic initiative, personally for me to know about.

Is there a plan to expose this data via a dev API of some sort for enthusiasts
like us?

Say a SPARQL or even (Open)Graph API perhaps?

My experiences consulting with and working with companies interested in the
domain has been that monetizing this data is extremely hard both legally and
quality wise.

"Nike Tanjun near me" is a query fraught with danger. People typing this query
want to find a retailer in their vicinity that sells this Nike product, but
where do we source that inventory list from and how do we get our cut?

Before people start talking about DSPs and SSPs, this is a very different
problem at hand.

To know that Nike Tanjun is a shoe sold by Nike, an ontology needs to exist
that captures this knowledge so that the user's query can be decoded.

How will that ontology be sourced? Further, for it to be usable commercially,
Nike has to agree to that encoding. Therin begin the challenges. If we encoded
Nike Downshifter to be Tanjun, by mistake, then the user bought them based off
our results, disliked them expecting the Downshifters to be like Tanjuns, we
have an issue and Nike could persue the matter because we mislead the customer
and affected their branding.

My primary clients are search companies or companies that want to provide rich
search functionality: Google, Bing or even DDG do a phenomenal job in this
space and the barrier to entry is pretty high.

So knowledge quality, mainteanaanace, versioning and temporal resolution ("The
President of the U.S.", "The iPhone" are different entities over time) aside,
is diffbot going to monetize this knowledge only as a B2B offering/addon to
their clients or are there other "bigger" plans to monetize this tremendous
undertaking and keep it rolling in the future?

------
johnymontana
What technologies are this built on? Something like Neo4j graph database?

~~~
dwynings
We tried Neo4j, but it couldn't support the throughput we needed for injecting
facts.

~~~
subhobroto
Upvoted. Have you given ArangoDB a shot?

~~~
miket
We've tried Neo4J, ArangoDB, as well as many others to store the triples.
Neo4J locked up at around 100M entities, and also the loading/injection times
weren't sufficient build the KG at a regular interval. However, we are closely
following any developments in these projects as they improve. There is more
detail in this interview here: [https://www.zdnet.com/article/the-web-as-a-
database-the-bigg...](https://www.zdnet.com/article/the-web-as-a-database-the-
biggest-knowledge-graph-ever/)

