There are difficult theoretical computer science problems that effectively limit the parallelization/distribution of generalized graph operators. To achieve high scalability you have to solve these computer science problems first. If this design offers a novel solution to the longstanding computer science problem then kudos, but nothing at the site suggests this is the case.
Many graph databases have claimed high scalability and distributability but none of those claims have held up over time due to the aforementioned computer science problems. This may be a very nice graph database but I am skeptical of the claims of "highly scalable, distributed" unless there is evidence that it uses fundamentally new theoretical computer science to achieve that.
The trick is not to build a generalized graph operator solution. It's to have a specialized graph operator solution, and then see how many solutions you can fit to the specialized graph operators. Turns out you can do a lot with a little.
For example, if you can reduce a trillion-edge graph analysis problem into a billion-edge graph plus some other stuff (usually materialized document structures) then you can fit that into something like Pregel. That is how almost all real-world graph analysis is done today.
But by doing so, you've lost the ability to do graph analysis on the other ~trillion edges for the sake of tractability in a narrow case. You can't do relationship analysis across the attached documents. There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.
The Holy Grail is still the ability to run ad hoc graph analytic queries against a massively distributed graph representation. There are no shortcuts around this for many interesting applications. Right now, we are limited to mere billions of edges for most practical purposes and all of the hacks and workarounds are designed to keep the number of true edges to around this number even when the data model is much larger.
"interesting" != "useful"
Again, the same thing has been true with distributed computing in general. MapReduce & Hadoop are pretty much the antithesis of where distributed computing research for the last ~20 years had been working, because MapReduce solves what is nearly an "embarrassingly parallel" problem.
> There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.
There are even more distributed computing algorithms that don't fit in to MapReduce terribly well (and really, it's not that algorithms don't work with MapReduce/Pregel, it's that they don't work well), but it is still quite useful.
Turns out, the reason it's the Holy Grail is that it is just flat out hard to do (provably so). While what Titan/Pregel do isn't nearly is difficult, it is surprisingly difficult to do at massive scale, so just doing the simple stuff they do is quite useful and game changing.
This is essentially the same underlying problem that is the source of why distributed NoSQL databases do not support join operations. In the case of NoSQL databases, they simply do not support joins because it is not a core operations. (Technically you can still do a join, it just has terrible scaling characteristics.)
The fundamental operation of graph databases are relational joins by another name, which means that graph databases have the same limitation on distribution that distributed NoSQL databases have on joins. However, unlike NoSQL databases it is their primary operation so they can't just not support it. Consequently, the only way to have a "graph database" that is massively distributable is to solve the same problem that prevents distributed databases from supporting joins.
There's a paper on 4store here: http://4store.org/publications/harris-ssws09.pdf
Solutions to the graph partitioning problem exist and among people doing high-end graph analytics this has been rumored for years now. It just is not published and people that know how it is done are slathered in NDAs. I know of two different (related) algorithms for parallelizing graph analysis. IBM Research currently has the most advanced algorithms for graph analysis and they disclose very little about how they work.
It can't do half of what Titan does and is a much, much simpler design. But it is fast, easy to use and works really well for 80% of the use cases you might have for a graph database on the web (social graphs, semantic web stuff, etc.)
OrientDB tries to scale writes (I'll be testing this in a few months), but still stores all of the data everywhere.
This looks like it shards the data automagically. If it works well, I might be able to bang on it a bit, but I'm guessing that it gives shit performance for complex graph questions.
Find all Nodes with a property in a tree
Find all leaves L of those nodes
Find all annotations in a DAG of those leaves
Collapse similar DAG entries by backtracking up the graph based on edge weights
Writes are bulk loaded, and right now, we are just trying to push all of the graph stuff offline, but there are some limitations to that, and we could really up our accuracy by being able to perform these queries quickly.
What is interesting about a graph database relative to simple key-value database? Storing edges of a graph is trivial for a key-value store and so it seems like any key-value store could let store the basic graph structure?
Do graph databases support graphic-specific queries and indices?
There are many, many people on HN who are much more knowledgeable than I am on graph DB's, and I sure as hell hope they answer on this question.
I'm curious if this supports the RDF, OWL, and SPARQL standards?
I'm a little tired of graph DB's that focus on scale, rather than speed and flexibility though. A good one to check out is Stardog. http://stardog.com/ I think it just went into 1.0.
What triple stores have you looked at? 4store is performant, but doesn't support reasoning. There's also BigData and Virtuoso which support various levels of it, and Franz are apparently working on a clustered version of Allegrograph.
5.3.1. Co-Favorited Places - Users Who Like x Also Like y
Find places that people also like who favorite this place:
* Determine who has favorited place x.
* What else have they favorited that is not place x.
START place=node:node_auto_index(name = "CoffeeShop1")
RETURN stuff.name, count(*)
ORDER BY count(*) DESC, stuff.name
: PDF: http://docs.neo4j.org/pdf/neo4j-manual-milestone.pdf , online: http://docs.neo4j.org/chunked/milestone/
Now whether or not Titan implements these performantly, I have no idea.
Graph databases are great for datasets where structure matters more than in a relational database and (way) more than in a K/V-store. When you're doing traversals of arbitrary depth, other technologies fall on their collective face.
As for special indexes and queries, look at things like http://docs.neo4j.org/chunked/snapshot/cypher-query-lang.htm... expressing patterns and walks in graphs.
You can combine index lookups in e.g. Redis, MongoDB or Lucene with the core graph engine.
See http://docs.neo4j.org/chunked/snapshot/index.html for docs.
Does that mean, for each vertex, is it's sub-graph indexed?!