Hacker News new | comments | show | ask | jobs | submit login
Titan: A Highly Scalable, Distributed Graph Database (github.com)
170 points by d0ugal 2008 days ago | hide | past | web | favorite | 39 comments

What makes this graph database "highly scalable, distributed"?

There are difficult theoretical computer science problems that effectively limit the parallelization/distribution of generalized graph operators. To achieve high scalability you have to solve these computer science problems first. If this design offers a novel solution to the longstanding computer science problem then kudos, but nothing at the site suggests this is the case.

Many graph databases have claimed high scalability and distributability but none of those claims have held up over time due to the aforementioned computer science problems. This may be a very nice graph database but I am skeptical of the claims of "highly scalable, distributed" unless there is evidence that it uses fundamentally new theoretical computer science to achieve that.

You could have said the same thing about generalized distributed computing as well. Much like with MapReduce, the trick is to recognize that there is an 80% solution that works quite well (based around BSP):



The trick is not to build a generalized graph operator solution. It's to have a specialized graph operator solution, and then see how many solutions you can fit to the specialized graph operators. Turns out you can do a lot with a little.

Approaches like Pregel have been the canonical way of dealing with large graph problems for many years. Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.

For example, if you can reduce a trillion-edge graph analysis problem into a billion-edge graph plus some other stuff (usually materialized document structures) then you can fit that into something like Pregel. That is how almost all real-world graph analysis is done today.

But by doing so, you've lost the ability to do graph analysis on the other ~trillion edges for the sake of tractability in a narrow case. You can't do relationship analysis across the attached documents. There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.

The Holy Grail is still the ability to run ad hoc graph analytic queries against a massively distributed graph representation. There are no shortcuts around this for many interesting applications. Right now, we are limited to mere billions of edges for most practical purposes and all of the hacks and workarounds are designed to keep the number of true edges to around this number even when the data model is much larger.

> Unfortunately, most interesting graph analytic problems do not fit into that model because the "graph-like" aspect is still limited to problem sizes that fit conventional algorithms.

"interesting" != "useful"

Again, the same thing has been true with distributed computing in general. MapReduce & Hadoop are pretty much the antithesis of where distributed computing research for the last ~20 years had been working, because MapReduce solves what is nearly an "embarrassingly parallel" problem.

> There are many, many graph analytic problems that require a true graph that is orders of magnitude larger than what can be partitioned even after accounting for graph reduction techniques such as those used in Pregel.

There are even more distributed computing algorithms that don't fit in to MapReduce terribly well (and really, it's not that algorithms don't work with MapReduce/Pregel, it's that they don't work well), but it is still quite useful.

Turns out, the reason it's the Holy Grail is that it is just flat out hard to do (provably so). While what Titan/Pregel do isn't nearly is difficult, it is surprisingly difficult to do at massive scale, so just doing the simple stuff they do is quite useful and game changing.

Agreed, this is why our solution was to use Node.JS and achieve parallelism through eventually consistent independent data-stores separated on each physical web-server application instance, I'll gladly switch to a distributed centrally managed system when it becomes available.

You mean CAP theorem? http://en.wikipedia.org/wiki/CAP_theorem I imagine it is either the A or the P that gets to be the victim, but I'm not sure which in this case.

No, I am referring to the set of problems related to graph partitioning.

This is essentially the same underlying problem that is the source of why distributed NoSQL databases do not support join operations. In the case of NoSQL databases, they simply do not support joins because it is not a core operations. (Technically you can still do a join, it just has terrible scaling characteristics.)

The fundamental operation of graph databases are relational joins by another name, which means that graph databases have the same limitation on distribution that distributed NoSQL databases have on joins. However, unlike NoSQL databases it is their primary operation so they can't just not support it. Consequently, the only way to have a "graph database" that is massively distributable is to solve the same problem that prevents distributed databases from supporting joins.

There are several distributed triple stores (triple stores are a specialised graph store).

http://4store.org http://virtuoso.openlinksw.com/ http://www.systap.com/bigdata.htm

There's a paper on 4store here: http://4store.org/publications/harris-ssws09.pdf

I've been working on the graph partitioning issues for distirbuted graphdbs for a while (my own pet project that my brain won't let me give up on) but it's been from the perspective of someone who's not been keeping up to date with the Comp Sci literature. Have you got any suggestions for canonical papers from the last decade or so, for the state of the art?

There has not been much that is both new and interesting in the graph partitioning literature in a long time. What you already know is probably not too far off from what is in the current literature. The literature on this topic has been stagnant for years.

Solutions to the graph partitioning problem exist and among people doing high-end graph analytics this has been rumored for years now. It just is not published and people that know how it is done are slathered in NDAs. I know of two different (related) algorithms for parallelizing graph analysis. IBM Research currently has the most advanced algorithms for graph analysis and they disclose very little about how they work.

Thank you, this is really great to show AP or CP but never CA and call it distributed.

Thank you, this shows that for the one you want, the choice is in the backend.

We are going to address alot of these questions in our presentation tonight. However, for the Hacker News crew that won't be there tonight, here is an early release of the Titan talk.



Enjoy!, Marko.

Great presentation.

Also it appears like a great way to take speakerdeck down...

Very interesting. Since it implements Blueprints, does it also have support for Furnace (graph algorithms)? If so, does this imply that graph processing is done on disk rather than in-memory? I am rather unfamiliar with Blueprints but I'm wondering how Titan implements that aspect.

I don't want to spam this thread, but I feel I have to plug my personal pet project Related (https://github.com/sutajio/related), since it is, well.. yeah... "related" to the topic being discussed.

It can't do half of what Titan does and is a much, much simpler design. But it is fast, easy to use and works really well for 80% of the use cases you might have for a graph database on the web (social graphs, semantic web stuff, etc.)

Interesting, does anyone know how this compares with the popular Neo4j Database?

Each Neo4j node stores all of the data, and it doesn't scale write horizontally well.

OrientDB tries to scale writes (I'll be testing this in a few months), but still stores all of the data everywhere.

This looks like it shards the data automagically. If it works well, I might be able to bang on it a bit, but I'm guessing that it gives shit performance for complex graph questions.

Titan exposes graph data over a machine cluster. It is an OLTP system that allows you to do local neighborhood graph traversals in sub-second time. For OLAP processing (e.g. global graph algorithms), Aurelius will be releasing two projects named Faunus and Fulgora in the coming months. These provide Hadoop connectivity and compressed in-memory representations of "graph slices." We will be publishing our talk slides tonight that discuss this eco-system of graph technologies. See http://titanbiggraphdata.eventbrite.com/

I really wish that I could attend. We ask a lot of questions like:

Find all Nodes with a property in a tree

Find all leaves L of those nodes

Find all annotations in a DAG of those leaves

Collapse similar DAG entries by backtracking up the graph based on edge weights

Writes are bulk loaded, and right now, we are just trying to push all of the graph stuff offline, but there are some limitations to that, and we could really up our accuracy by being able to perform these queries quickly.

Congrats on the release, Marko!

Also, Aurelius will be posting a blog post in ~2 weeks where we will be simulating Twitter. We replay Twitter from day 1 to June 2009 and slam it with ~10,000 concurrent users writing/reading follows relationships, tweets, stream constructions -- ultimately growing (what we think will be) a 3 billion edge graph when the simulation is complete.


What is interesting about a graph database relative to simple key-value database? Storing edges of a graph is trivial for a key-value store and so it seems like any key-value store could let store the basic graph structure?

Do graph databases support graphic-specific queries and indices?

I can only answer for one advantage I specifically know of regarding graph DB's over key value: dynamic, mergeable schemas which enforce data integrity WITHIN the database rather than with code on top of it.

There are many, many people on HN who are much more knowledgeable than I am on graph DB's, and I sure as hell hope they answer on this question.

I'm curious if this supports the RDF, OWL, and SPARQL standards?

I'm a little tired of graph DB's that focus on scale, rather than speed and flexibility though. A good one to check out is Stardog. http://stardog.com/ I think it just went into 1.0.

It looks like it's Gremlin only.

What triple stores have you looked at? 4store is performant, but doesn't support reasoning. There's also BigData and Virtuoso which support various levels of it, and Franz are apparently working on a clustered version of Allegrograph.

Sure you can easily store graphs in a KV store, but graph databases allow you to efficiently traverse the graph too. Look up how to traverse graph structures in hadoop for example - I was looking it up recently for something Im building and in the end I opted for using hadoop to filter, aggregate and format the raw data and then input the preprocessed data into graph databases (Im using orientdb at the moment) for traversal queries. I duplicate the data on multiple machines and use storm bolts to do geaph queries (so each storm bolt has its own local copy of the graph database). Seems to be working well so far and allows me to do realtime traversals (though inserts wait hadoop the next hadoop batch job). Seemed a lot easier than what Id have to do to efficiently process using hadoop.

Read Neo4j's documentation [1](well, don't read; just gloss over it). You'll be blown away by how much it can do. I recommend looking at chapter 5. An example, with the actual query:


5.3.1. Co-Favorited Places - Users Who Like x Also Like y

Find places that people also like who favorite this place:

* Determine who has favorited place x.

* What else have they favorited that is not place x.


    START place=node:node_auto_index(name = "CoffeeShop1")
    MATCH place<-[:favorite]-person-[:favorite]->stuff
    RETURN stuff.name, count(*)
    ORDER BY count(*) DESC, stuff.name

[1]: PDF: http://docs.neo4j.org/pdf/neo4j-manual-milestone.pdf , online: http://docs.neo4j.org/chunked/milestone/

They seem to implement a variety of graph centric APIs. From a query standpoint, it's easier to use a graph database to do something like find all the people who share interests with someone (person <--> interest <--> person), or find all the grand children in a family tree.

Now whether or not Titan implements these performantly, I have no idea.

What's interesting about key-value databases? Storing keys and values in an RDBMS is trivial...

Graph databases are great for datasets where structure matters more than in a relational database and (way) more than in a K/V-store. When you're doing traversals of arbitrary depth, other technologies fall on their collective face.

Yes, all information models are basically isomorphic. However, Neo4j for instance maintains referential integrity along the relationships, making sure there will never be a relationships without start- or endnode. In a K/V store or document store, there are no guarantees that your IDs pointing to other objects are updated as you change, delete or move the target data. Also, the graph structure is maintained on dics, meaning you can direct pointers on the storage level even in non-cached scenarios, while in a K/V or Document store you need to recreate the graph in memory before being able to do anything with it.

As for special indexes and queries, look at things like http://docs.neo4j.org/chunked/snapshot/cypher-query-lang.htm... expressing patterns and walks in graphs.

You can combine index lookups in e.g. Redis, MongoDB or Lucene with the core graph engine.

See http://docs.neo4j.org/chunked/snapshot/index.html for docs.


Storing graphs in a key/value store is trivial. Querying them is not. Gremlin is much more powerful than anything you can home bake.

Where's the SPARQL support?

Titan currently does not have "edge indexing" and thus can not implement Blueprints' GraphSail interface [https://github.com/tinkerpop/blueprints/wiki/Sail-Ouplementa...]. If it did implement GraphSail, then SPARQL would be supported via the Sail SPARQL engine.

"Vertex-centric indices provide vertex-level querying to alleviate issues with the infamous super node problem"

Does that mean, for each vertex, is it's sub-graph indexed?!

Anyone tried to use this with Spring?

Why the downvote? Spring Data Neo4j [1] works excellent, it's pretty popular and I'm interested has anyone tried working with TITAN in a Spring project(probably using the Blueprints API).

[1] http://www.springsource.org/spring-data/neo4j

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact