
Graph Databases Intrigue Me - jlank
http://blog.jlank.com/2011/02/26/graph-databases-intrigue-me/
======
joe_the_user
Graph databases intrigue me too.

The thing is, it took me a little while to realize that the "the semantic web"
is a very specific model where providers are be expected to _explicitly_
provide the semantic decoration/meta-data for all their content.
<http://en.wikipedia.org/wiki/Semantic_Web>

I basically don't believe that this particular approach will ever work (ie,
the flood-gates won't open and content providers won't suddenly label all
their data). I mean, this approach has been the failed-model of hypertext
since ... project Xanadu, mid-sixties (a well-tended, fully meaningful store
of data).

Instead, Google and other search engines and tools will just get smarter.

We'll find more ways to incidentally get semantic information from the raw
data that's out there. But no will have enough incentive to _manually_ provide
that much deep-meaning for their data _themselves_ (and anything whose
semantic meaning can be automatically processed can be put on the web for
_someone else_ to automatically process). The semantic web approach is always
going to be behind the curve compared to just putting raw, unstructured data
out-there.

The more uses we find for information, the more ability we'll get to extract
meaning from it without the data starting out labeled.

Look at what Watson could do.

\-- And I am working on a tool that extract implicit information from the
process of people interacting with data. Extracting implicit, inferred and
deduced relations has much more promise even if it can't rely on explicit
semantic labels. This is more or less what Google does also (it's true that
_so-far_ , Google's stuff is considered "semantically meaningless" and I know
Google bought Metaweb. We'll see what they get from it...)

~~~
Homunculiheaded
I've tried many times over the years to get excited about the semantic web,
but invariably always arrive at the same conclusion as you. I also think that
semweb really suffers from over engineering a solution in search of a problem.
If you have a problem that it seems like the semantic technologies could be
helpful for, you can probably solve it faster using logic programming and
automated reasoning techniques from the 90s (ask the fast majority of semantic
web enthusiasts how RDF triples can be represented as prolog clauses and
you'll likely be treated as though you asked how a raven is like a writing
desk). Worse yet though are tools like OWL, which is over-engineered to the
point where out-of-the-box it describes logic in a way that is computationally
intractable. If you where designing semantic technologies to solve a real
problem you would never arrive at something like OWL.

I actually think something like semantic technologies could be more useful
than the tools that drive watson and google for smaller data sets, where
machine/statistical learning are less useful, but even then they are an over
engineered solution.

------
colanderman
I'm having trouble understanding how exactly graph databases differ from
relational databases, especially seeing as how graphs are isomorphic to
relations.

All the examples on this page look equivalent to how I'd model them in SQL:
<http://wiki.neo4j.org/content/Domain_Modeling_Gallery>

The best I gather is that graph databases are schemaless (big whoop, more room
for error), and that their implementations tend to perform well with
transitive closures. I'm not seeing anything that can't be solved with
materialized views in an RDBMS.

I'm tempted to think that just represents NoSQL folks coming to realize that
relations are actually a good thing. Soon they'll be talking about how
wonderful it is to take the Cartesian product of two graphs.

Someone care to enlighten me?

~~~
yummyfajitas
Suppose you have a graph with edge weights.

Query: find all the nodes a distance < D from some reference node.

With SQL92 or earlier, this will require an unbounded number of queries.
Proof: A SQL92 query can only traverse a finite number of edges, since it has
only a finite number of joins (say K). Consider the graph 1->2->3->...->N with
edge weights 1/N. This will require N/K queries to traverse. N is arbitrary.

It's probably doable in SQL99 and various vendor specific extensions (I think
it's Turing Complete), but it is unlikely to be fast.

So, as is the normal situation, you can do whatever you want in SQL. But the
performance is likely to suck.

(That said, I agree with you that many people are using NoSQL because it's the
hip new thing, when they should just use SQL. "Schemaless flexibility" is BS -
SQL is very flexible if you use an ORM + migration tool. The only flexibility
you lose is the flexibility to screw up the data in application logic.)

~~~
colanderman
Trivial with SQL-standard recursive queries:

WITH RECURSIVE r AS (SELECT parent, child, distance FROM graph UNION ALL
SELECT r.parent, graph.child, r.distance + graph.distance AS distance FROM
graph, r WHERE graph.parent = r.child) SELECT parent, child FROM r WHERE
distance < 3.14159;

Testing your graph with N=1000 on a teeny VPS. 477,897 rows returned in 6.1s.

If we move the WHERE clause inside the sub-SELECT (which unfortunately makes
the query slightly more awkward to use), selecting lower distances reduces the
compute time. For example, a bound of 0.5 returns 196,238 rows in 2.5s.

What sort of performance gains does a graph database give for this same query?

~~~
yummyfajitas
As I said, you can probably do it with SQL99 query, them being turing complete
and all.

I don't know how your database is implementing things under the hood, so I
don't know. I'll just list the major advantages of a graph database:

1) Edges stored with nodes. You can almost certainly read an edge _and_ all
it's neighbors with a single disk seek. This is unlikely to be the case for a
SQL system.

2) Compute time is always O( portion of graph traversed), not O(graph size),
and is always under the control of the user. In SQL, this will vary quite a
bit with implementation.

3) You can specify various heuristics for the traversal - depth first, breadth
first, etc. This is important if the specific nature of your data will affect
the speed of the query.

~~~
colanderman
Ah thank you, this answers my original question. In response:

1) I contest that this is "unlikely" for SQL -- MySQL (unless I'm
misremembering) orders data on disk by its primary key, which would be an
equivalent optimization. (PostgreSQL does not do this to my knowledge.)

2) In my two queries above, I demonstrate both O(graph) and O(traversed) (in
that order). At least in PostgreSQL, this is very much under control of the
user as well, since WITH queries are evaluated in a specific order.

3) There is no way to specify such things in SQL. However most SQLs are
designed with the idea that the query planner is smarter than you (it almost
always is, at least in PostgreSQL), so you shouldn't be trying to specify
execution plans anyway.

Overall, I see no reason for graph databases & SQL to be _separate_. It seems
that if graph DBs really only address performance issues as you claim, that
one is throwing out the baby with the bathwater by creating a new _kind_ of
database rather than simply adding specialized optimizations to e.g.
PostgreSQL.

~~~
yummyfajitas
1) Ordering data on disk by primary key is irrelevant. You look up the node -
one seek. Look up the edges - another seek. Look up the nodes pointed to by
the edges - several more seeks.

In fact, looking up the edges is multiple seeks. For an edge (A, B), if the
primary key is the row itself, edges are only stored in sequence if you are
starting at node A.

Graph DB's usually cheat by store edges on both nodes, if possible within the
same page as the node.

2) I'll take your word for it.

3) I generally agree that the query planner is smarter, but I'd be very
surprised if that is true for complicated recursive queries. In particular, I
really doubt that a query planner can exploit the structure of your data to
improve things.

I.e., I really doubt the planner will know whether a DFS or BFS is better, or
come up with heuristics to select the optimal path to go down (this is heavily
dependent on data). Graph databases tend to provide such functionality.

I absolutely agree with you that better graph support for a SQL DB would be
ideal.

------
Tichy
One thing I wonder is how far do graph databases actually take you? I take the
"thing" of a graph database to be to give me all links with a single access
(that is all links of a node are in a single data set). On the other hand, in
a classical SQL setup links would probably be modelled in a table with one row
per link.

But still, even if I can get all links of a node with one access, it seems to
me memory will be too scarce in most cases. Let's say I want to consider all
nodes that are three steps away, and every node has 100 outgoing links. That's
100^3 nodes to consider already. Or in the second step, 100^2 nodes, so we
would have to access the db 10000 times to get all nodes of the third step.

What I am getting at: it seems to me for really interesting graph computations
it will usually still be necessary to create some specialized/compressed model
to fit all the data into main memory.

------
Prasannav
I hope graph databases intrigues me sometime in the near future . I have been
playing around with freebase api for sometime and i must admit that freebase
does a fairly decent job in recognizing some of the entities that they promise
to identify . Location is one such entity . They use a mix of NLP and a DB of
locations to identify location entity . They ofcourse have a semantic
relationship (city -> country -> continent for eg) between these entities that
can offer u more insight . However , things like "context" of a free flowing
text need to mature . They provide something called as the Social Tag for any
text that u paste but sometimes it is too generic and sometimes it is far from
the right context and many a times there is no social tags . So we had to kind
of move away from relying on Freebase and figure our own ways . I agree to
Joe's comments that extracting semantic info has to get more smarter , and
relying on webmasters to provide this data is certainly not going to be
scalable and achievable in the near future . google's acquisition of freebase
did come as a surprise to me (considering the current capability of FB) , but
their promise of providing a weekly dump meant there was a good news (not for
long as we did not continue relying on FB anyways)

Looking at Watson could do , made me wonder why is the technology world so
lagging behind w.r.t interpreting information .If NLP is a solved problem
(looks like in Watson's case) are we only pending creating a linkage between
the real world entities???. Freebase has linked 20 million but thats not
enough . The approach is non profit , "good for the world" kinds . Can there
be an incentive for people to provide links between entities . Can we bring up
a profit model where people/organizations compete to provide more and better
linkages . Or can we extract such links from peoples web activity (search ,
social networking etc [FB,Twitter]) .. i am getting too many ideas now :D:D
... and ofcourse can we request IBM to donate their NLP technology for the
greater good :P :P

------
aristidb
Graph databases are probably also nice for a lot of "non-semantic" stuff.

For example, I recently wondered about the best representation for deep
hierarchical data, and graph databases appear to be a good fit.

------
olihb
I'm curious about the performance and scalability of those database. Right now
at work we have a MSSQL server with 32 CPU, SSDs and 128GB RAM and I was
wondering if a graph database would be better.

Our dataset is very peculiar, we have about 20M nodes and 500M edges in one of
our smallest database. We also need fulltext search and complex joins between
tables.

I'm wondering if graph databases are mature enough or if they are not quite
ready yet for production work. We're a small shop and software development is
not our core competency.

~~~
thesz
[http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-
lar...](http://markorodriguez.com/2011/02/18/mysql-vs-neo4j-on-a-large-scale-
graph-traversal/)

Most graph databases use BerkeleyDB under the hood. They are pretty safe to
work with.

------
PaulHoule
Graph databases are fascinating, but as an engineer who cares about query
speed, indexing, and how things interact with memory hierarchies, I'm glad to
leave them to the computer "scientists", who can go ahead filling volume after
volume of conference proceedings with minimal impact on the practice of
computing.

~~~
colanderman
This is a pretty ignorant comment. Try reading those conference proceedings
sometime and you'll realize who's responsible for your fast query speed.

