

DeepDive – System that enables developers to analyze data on a deeper level - DyslexicAtheist
http://deepdive.stanford.edu/

======
rstoner
This could be a huge value-add to the groups that have invested heavily in
human-directed knowledge graph construction (e.g. Project Halo/Aristo at the
Allen Institute for AI).

------
signa11
kllj

------
chapulin
It's also being used to aid paleontology research:
[http://fusion.net/story/30751/paleo-deep-dive-machine-
learni...](http://fusion.net/story/30751/paleo-deep-dive-machine-learning/)

------
polskibus
I'm mostly interested in how much does it differ from what IBM Watson does.
Does IBM only rely on probabilistic inference or does it do other data mining
as well?

~~~
nl
It's (very) roughly comparable to parts of it.

Firstly: IBM is increasingly using the Watson brand for things that don't
appear directly related to the Jeopardy winning system (eg, Watson Analytics).
When I talk about Watson here I mean the Question Answering (QA) system.

At a very high level DeepDive consists of a Knowledge Graph construction tool,
and a probabilistic querying tool. Compared to Watson it is missing a natural
language question parsing tool, and any way of dealing with questions that
aren't in the KG.

Watson has (very strong) natural language understanding for multi-claused
questions, and the Jeopardy version can do things like understand puns.
Deepdive doesn't have anything comparable. In the open source space, the
closest thing I'm aware of is SEMPRE[1][2].

Watson also has a evidence scoring module, and my understanding is that this
can work against unstructured data. Deepdive doesn't have this, and instead
relies on probabilistic inference. This is an excellent approach, but relies
on doing content extraction first (ie, extract entities and relationships from
text and/or other sources). The Microsoft Probase[3] group has published lots
in this area.

[1] [http://www-
nlp.stanford.edu/joberant/homepage_files/publicat...](http://www-
nlp.stanford.edu/joberant/homepage_files/publications/ACL14.pdf)

[2]
[https://github.com/percyliang/sempre](https://github.com/percyliang/sempre)

[3] [http://research.microsoft.com/en-
us/projects/probase/](http://research.microsoft.com/en-us/projects/probase/)

------
phreeza
I am wondering what a ballpark figure would be how long it would take to set
up an instance of this for a given scientific field for example. Days? Months?
Years? I fear it is probably the latter.

~~~
batbomb
I've sat in on Chris' class at Stanford.

I think the answer is probably closer to weeks to months if working with field
experts, depending on how deep you want to go.

The core of it is open source.

I think the most exciting thing about it is it brings more sophisticated
computation to the more qualitative sciences.

~~~
tlmr
What about for just a smallish single machine corpus of document (1000)?

~~~
nl
The size of the corpus isn't the issue (apart from processing time of course).

The key issue in estimating how big a job it is is how complex your entity
extraction and inference rulesets are.

------
nl
It does probabilistic inference![1]

So many open source "Knowledge Graph"-y type projects concentrate on building
them like databases, with a query language that assumes the data in them is
correct. You see this in things like Freebase, DBPedia and Wikidata, where
they typically end up in a triple store and you query using SPARQL.

This isn't how the real world works, and there isn't a lot of publicly
available software that takes this into account. There aren't even than many
papers about it (the Microsoft Probase paper is one, and there is work from
Florida University(?) about using Markov chains to reason while taking
probabilities into about).

I'm excited to take a look at this.

[1]
[http://deepdive.stanford.edu/doc/general/inference.html](http://deepdive.stanford.edu/doc/general/inference.html)

~~~
anonetal
Aside from the work on probabilistic inference, there is also many papers on
"probabilistic databases" in the last 10 years (Chris did his PhD on that
topic). That work has looked at SQL-style query processing over
"uncertain"/"probabilistic" data.

These were some of the major projects:
[https://homes.cs.washington.edu/~suciu/project-
mystiq.html](https://homes.cs.washington.edu/~suciu/project-mystiq.html),
[http://maybms.sourceforge.net/](http://maybms.sourceforge.net/),
[http://infolab.stanford.edu/trio/](http://infolab.stanford.edu/trio/),
[http://www.cs.umd.edu/~amol/PrDB/](http://www.cs.umd.edu/~amol/PrDB/),
[http://dl.acm.org/citation.cfm?id=1376686](http://dl.acm.org/citation.cfm?id=1376686).

~~~
nl
This is a fair point. In a similar vein there is also BayesDB
[http://probcomp.csail.mit.edu/bayesdb/](http://probcomp.csail.mit.edu/bayesdb/)

~~~
peterlvilim
An HDFS oriented one (with SQL style queries):

[https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf](https://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf)

[https://github.com/sameeragarwal/blinkdb](https://github.com/sameeragarwal/blinkdb)

~~~
nl
Is it?

I thought BlinkDB was data warehousing on hdfs? I dont see any mention of
inference-like features in the docs?

~~~
anonetal
It's not similar. BlinkDB builds upon the work on sampling-driven approximate
SQL query processing (an early project in that space was AQUA@Bell Labs), and
extends it to cloud/HDFS setting.

Although some terms come up in both places (e.g., confidence bounds, noise,
etc), BlinkDB and probabilistic databases are fundamentally different from
each other (I have worked on both topics).

