What makes this graph database "highly scalable, distributed"?
There are difficult theoretical computer science problems that effectively limit the parallelization/distribution of generalized graph operators. To achieve high scalability you have to solve these computer science problems first. If this design offers a novel solution to the longstanding computer science problem then kudos, but nothing at the site suggests this is the case.
Many graph databases have claimed high scalability and distributability but none of those claims have held up over time due to the aforementioned computer science problems. This may be a very nice graph database but I am skeptical of the claims of "highly scalable, distributed" unless there is evidence that it uses fundamentally new theoretical computer science to achieve that.
You could have said the same thing about generalized distributed computing as well. Much like with MapReduce, the trick is to recognize that there is an 80% solution that works quite well (based around BSP):
The trick is not to build a generalized graph operator solution. It's to have a specialized graph operator solution, and then see how many solutions you can fit to the specialized graph operators. Turns out you can do a lot with a little.
Approaches like Pregel have been the canonical way of dealing with large graph problems for many years. Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.
For example, if you can reduce a trillion-edge graph analysis problem into a billion-edge graph plus some other stuff (usually materialized document structures) then you can fit that into something like Pregel. That is how almost all real-world graph analysis is done today.
But by doing so, you've lost the ability to do graph analysis on the other ~trillion edges for the sake of tractability in a narrow case. You can't do relationship analysis across the attached documents. There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.
The Holy Grail is still the ability to run ad hoc graph analytic queries against a massively distributed graph representation. There are no shortcuts around this for many interesting applications. Right now, we are limited to mere billions of edges for most practical purposes and all of the hacks and workarounds are designed to keep the number of true edges to around this number even when the data model is much larger.
> Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.
"interesting" != "useful"
Again, the same thing has been true with distributed computing in general. MapReduce & Hadoop are pretty much the antithesis of where distributed computing research for the last ~20 years had been working, because MapReduce solves what is nearly an "embarrassingly parallel" problem.
> There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.
There are even more distributed computing algorithms that don't fit in to MapReduce terribly well (and really, it's not that algorithms don't work with MapReduce/Pregel, it's that they don't work well), but it is still quite useful.
Turns out, the reason it's the Holy Grail is that it is just flat out hard to do (provably so). While what Titan/Pregel do isn't nearly is difficult, it is surprisingly difficult to do at massive scale, so just doing the simple stuff they do is quite useful and game changing.
No, I am referring to the set of problems related to graph partitioning.
This is essentially the same underlying problem that is the source of why distributed NoSQL databases do not support join operations. In the case of NoSQL databases, they simply do not support joins because it is not a core operations. (Technically you can still do a join, it just has terrible scaling characteristics.)
The fundamental operation of graph databases are relational joins by another name, which means that graph databases have the same limitation on distribution that distributed NoSQL databases have on joins. However, unlike NoSQL databases it is their primary operation so they can't just not support it. Consequently, the only way to have a "graph database" that is massively distributable is to solve the same problem that prevents distributed databases from supporting joins.
I've been working on the graph partitioning issues for distirbuted graphdbs for a while (my own pet project that my brain won't let me give up on) but it's been from the perspective of someone who's not been keeping up to date with the Comp Sci literature. Have you got any suggestions for canonical papers from the last decade or so, for the state of the art?
There has not been much that is both new and interesting in the graph partitioning literature in a long time. What you already know is probably not too far off from what is in the current literature. The literature on this topic has been stagnant for years.
Solutions to the graph partitioning problem exist and among people doing high-end graph analytics this has been rumored for years now. It just is not published and people that know how it is done are slathered in NDAs. I know of two different (related) algorithms for parallelizing graph analysis. IBM Research currently has the most advanced algorithms for graph analysis and they disclose very little about how they work.
Agreed, this is why our solution was to use Node.JS and achieve parallelism through eventually consistent independent data-stores separated on each physical web-server application instance, I'll gladly switch to a distributed centrally managed system when it becomes available.
Very interesting. Since it implements Blueprints, does it also have support for Furnace (graph algorithms)? If so, does this imply that graph processing is done on disk rather than in-memory? I am rather unfamiliar with Blueprints but I'm wondering how Titan implements that aspect.
Titan exposes graph data over a machine cluster. It is an OLTP system that allows you to do local neighborhood graph traversals in sub-second time. For OLAP processing (e.g. global graph algorithms), Aurelius will be releasing two projects named Faunus and Fulgora in the coming months. These provide Hadoop connectivity and compressed in-memory representations of "graph slices." We will be publishing our talk slides tonight that discuss this eco-system of graph technologies. See http://titanbiggraphdata.eventbrite.com/
I really wish that I could attend. We ask a lot of questions like:
Find all Nodes with a property in a tree
Find all leaves L of those nodes
Find all annotations in a DAG of those leaves
Collapse similar DAG entries by backtracking up the graph based on edge weights
Writes are bulk loaded, and right now, we are just trying to push all of the graph stuff offline, but there are some limitations to that, and we could really up our accuracy by being able to perform these queries quickly.
Also, Aurelius will be posting a blog post in ~2 weeks where we will be simulating Twitter. We replay Twitter from day 1 to June 2009 and slam it with ~10,000 concurrent users writing/reading follows relationships, tweets, stream constructions -- ultimately growing (what we think will be) a 3 billion edge graph when the simulation is complete.
What is interesting about a graph database relative to simple key-value database? Storing edges of a graph is trivial for a key-value store and so it seems like any key-value store could let store the basic graph structure?
Do graph databases support graphic-specific queries and indices?
I can only answer for one advantage I specifically know of regarding graph DB's over key value:
dynamic, mergeable schemas which enforce data integrity WITHIN the database rather than with code on top of it.
There are many, many people on HN who are much more knowledgeable than I am on graph DB's, and I sure as hell hope they answer on this question.
I'm curious if this supports the RDF, OWL, and SPARQL standards?
I'm a little tired of graph DB's that focus on scale, rather than speed and flexibility though. A good one to check out is Stardog. http://stardog.com/ I think it just went into 1.0.
What triple stores have you looked at? 4store is performant, but doesn't support reasoning. There's also BigData and Virtuoso which support various levels of it, and Franz are apparently working on a clustered version of Allegrograph.
Sure you can easily store graphs in a KV store, but graph databases allow you to efficiently traverse the graph too. Look up how to traverse graph structures in hadoop for example - I was looking it up recently for something Im building and in the end I opted for using hadoop to filter, aggregate and format the raw data and then input the preprocessed data into graph databases (Im using orientdb at the moment) for traversal queries. I duplicate the data on multiple machines and use storm bolts to do geaph queries (so each storm bolt has its own local copy of the graph database). Seems to be working well so far and allows me to do realtime traversals (though inserts wait hadoop the next hadoop batch job). Seemed a lot easier than what Id have to do to efficiently process using hadoop.
What's interesting about key-value databases? Storing keys and values in an RDBMS is trivial...
Graph databases are great for datasets where structure matters more than in a relational database and (way) more than in a K/V-store. When you're doing traversals of arbitrary depth, other technologies fall on their collective face.
They seem to implement a variety of graph centric APIs. From a query standpoint, it's easier to use a graph database to do something like find all the people who share interests with someone (person <--> interest <--> person), or find all the grand children in a family tree.
Now whether or not Titan implements these performantly, I have no idea.
Yes, all information models are basically isomorphic. However, Neo4j for instance maintains referential integrity along the relationships, making sure there will never be a relationships without start- or endnode. In a K/V store or document store, there are no guarantees that your IDs pointing to other objects are updated as you change, delete or move the target data. Also, the graph structure is maintained on dics, meaning you can direct pointers on the storage level even in non-cached scenarios, while in a K/V or Document store you need to recreate the graph in memory before being able to do anything with it.
I don't want to spam this thread, but I feel I have to plug my personal pet project Related (https://github.com/sutajio/related), since it is, well.. yeah... "related" to the topic being discussed.
It can't do half of what Titan does and is a much, much simpler design. But it is fast, easy to use and works really well for 80% of the use cases you might have for a graph database on the web (social graphs, semantic web stuff, etc.)