

The Distributed Graph Database Titan Provides Real-Time Big Graph Data - okram
http://thinkaurelius.com/2012/08/06/titan-provides-real-time-big-graph-data/

======
jandrewrogers
This looks like a very nice implementation of a graph database. However a
6-machine cluster barely qualifies as "distributed" for the purposes of a
graph database. You will experience almost no sub-linearity at this scale no
matter how poorly the algorithms distribute. I am not intimately familiar with
the transactional implementation but the graph processing algorithms described
in Titan are not significantly distributable.

For graph queries, linearity can be definitely achieved on parallel systems
with _tens of thousands_ of compute nodes using the state-of-the-art
algorithms. However, Titan does not use any of those algorithms and will be
very lucky to exhibit linearity at a couple dozen nodes. Not to take away from
the implementation but people looking for a scalable graph database will not
be satisfied with Titan.

As an aside, the best algorithms and data structures for doing massively
parallel graph operations are not published but obviously exist. The fastest
system in the Graph500 benchmark uses a half million cores on a trillion edge
graph. That is a several order of magnitude gap between what the best open
source systems can do and what systems developed by closed research
organizations can do as published in public benchmarks.

(Disclosure: I invented one of the aforementioned families of massively
parallel graph processing algorithms in 2009, and not the first. The published
literature has not even caught up with the first generation of such
algorithms. A significant portion of advanced computer science research is no
longer published.)

~~~
mbroecheler
I think you are looking at a very different use case here. The systems that I
think you are referring to analyze a static graph representation. The Graph500
benchmark in particular loads one big static, unlabeled, undirected, property-
free graph and then runs extensive (BFS) analysis algorithms on it. The fact
that the graph is not changing allows significant investment into building
locality optimizing data structures (which is essentially what space
decomposition is all about).

Titan on the other hand is a transactional database system to handle large,
multi-relational (labeled) graphs with heterogeneous properties. A Titan graph
is constantly evolving (as in the posted benchmark). For graphs (unlike geo-
spatial domains), applying space decomposition techniques first requires a
metric space embedding which is a non-trivial and computationally expensive
process. For changing graphs, this embedding will change as well making this
very difficult to use in practice. The best approaches I know of for achieving
locality therefore use adaptive graph partitioning techniques instead.
However, for the types of OLTP workloads that Titan is optimized for, this
would be overkill in the sense that the time spend on partitioning will likely
exceed the time saved at runtime. At very large scale, it is most important
for OLTP systems to focus on access path optimization based on the ACTUAL
query load experienced by the system and not some perceived sense of locality
based on connectedness. I published a paper a while ago suggesting one
approach to do so: [http://www.knowledgefrominformation.com/2010/08/01/cosi-
clou...](http://www.knowledgefrominformation.com/2010/08/01/cosi-cloud-
oriented-subgraph-identification-in-massive-social-networks/) The Graph500
benchmark explicitly prohibits this optimization ("The first kernel constructs
an undirected graph in a format usable by all subsequent kernels. No
subsequent modifications are permitted to benefit specific kernels").

~~~
jandrewrogers
Good comment.

I was using Graph500 as a decently documented public example more than the
only example. There are other problems based on real-world data in the
trillion edge range that serve as "hello world" models for testing massively
parallel graph algorithms. Directed and undirected, cyclic, and acyclic,
properties and property-free. Semantic databases and entity analytics are
popular test cases.

In the specific case of Graph500, the graph is significantly cyclic which
creates coordination issues if you simply denormalize the data (e.g.
replicating edges around a graph cut). Being able to do a massively parallel
BFS from any vertex in the graph and producing the globally correct result
_without_ replicating edges means that you cannot know how to optimize the
organization ahead of time. This was an intentional part of the benchmark
design. The Graph 500 does not lend itself to optimizing for a particular set
of adaptive graph cuts in any conventional sense; the algorithms used need to
be general over the 64 randomized runs and that benchmark was designed to
favor non-replicated edges when using massively parallel systems (the
coordination costs of edge replicas will kill your performance). However,
obviously the massively parallel systems are partitioning up the graph in some
sense.

In the specific case of the work I did a couple years ago, the systems can
ingest tens of millions of new edges per second concurrent with parallel
queries (not serializable multi-statement transactions, obviously). The ingest
structure can be identical to the structure against which ad hoc queries are
run without any kind of query optimization. The fact that ingest rates that
high are sustained effectively precludes dynamically reorganizing the data to
satisfy particular queries more optimally. In truth, it could be made more
optimal for batch-y type workloads (maybe 2-3x faster versus the dynamic
version?) but the point was to be able to throw massive amounts of hardware at
arbitrary graph data models rather than optimizing it for a specific query.

BTW, metric space embedding is non-trivial algorithmically but can also be
computationally inexpensive. The Macbook Air I am using now can do tens of
millions of those embedding operations per second on a single core for
moderately complex spaces and data models. Maybe an order of magnitude or two
slower if dealing with complex, non-Euclidean geometries. However, I also
spent a couple years of computer science R&D developing the algorithms to make
that fast. :-) I have been working in this particular area for a bit over half
a decade now so my perspective takes some things for granted I think. There
isn't just one problem you have to solve, there are actually several if you
are starting from scratch.

Like I said, I didn't want to take anything away from Titan and true OLTP-
oriented systems have their own complex problems, not the least of which is
that they don't scale too far beyond a couple hundred compute nodes for the
current state-of-the-art. Not my specialty. I work in a world of more basic
consistency guarantees.

Cheers!

------
jasonwatkinspdx
Great article.

I'd like to see a few other details that aren't mentioned:

\- What's the following distribution end up looking like? Does it have a
similar fraction of 'celebrity' users with huge follower counts? Or more
technically, does the russian roulette against the recommendation sampling end
up producing a network similar to a scale free graph grown via preferential
attachment (Barabási–Albert model)? It looks like your mean fanout is about
20, which is smaller than what twitter has published, but I'd be more
interested in knowing how many 10k+ follower users are in the graph.

\- What's the write amplification like? ~1.6 Billion per tweet per follower
edges stored daily seems like it could burn a lot of capacity quickly, though
most of it will grow cold quickly and could be pushed to archive. Making a
rough guess from your disk write monitoring line graph, it looks like you'd be
putting down about 16GB a day? It'd be interesting to see a comparison between
this run and one done where streams are built indirectly via follower links
alone.

~~~
mbroecheler
Hey,

\- the data we used was crawled by Kwak et. al in 2009. We wanted to use a
real social network dataset for the experiment and that was the largest/most
useful one we could find. Other than de-duplication we did not make any
modifications to the dataset, so the statistics reported in their paper still
hold:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.6476)

\- You mean what is the overhead induced by pre-computing the stream edge
rather than collecting the relevant streams at query time? You are right that
this requires a significant amount of storage space, however, as you also
pointed out, this will get cold quickly and be sitting on disk only (i.e. not
taking up space in the valuable cache). The reason this is very efficient is
because of the time-based vertex centric index we build for the stream edges.
This allows us to quickly pull out the most recent tweets for any user. If we
had to compute those at query time, we would have to traverse to each person
followed, get their 10 most recent tweets and then merge those in-memory. That
would be significantly more expensive and since stream reading is probably the
most frequent activity on twitter, pre-computing it saves a lot of time at the
expense of inexpensive disk storage.

------
espeed
Right now Titan traversals run locally and communicate with the back-end
Cassandra or HBase datastore via Thrift, but they are working on moving the
traversal code inside the Cassandra/HBase JVM so you can run Gremlin
traversals inside the distributed cluster rather than than pulling the
elements over the wire.

See [https://groups.google.com/d/msg/gremlin-
users/6GYiHn3QR8g/81...](https://groups.google.com/d/msg/gremlin-
users/6GYiHn3QR8g/81TFiW7yoTsJ)

------
erichocean
Cassandra is turning out to be one killer data backend. It's really exciting
to see what's being built on top of it.

Hadoop, Solr/Lucene, and now Blueprints/Grapb DB operations are all available
on the same Cassandra cluster, in addition to the stuff Cassandra does quote-
unquote natively. Add Zookeeper for the few times you need an honest-to-
goodness transaction and it's just crazy how good the tech has gotten on the
backend in the last 10 years. :-)

~~~
mbroecheler
Absolutely, without NoSQL solutions like Cassandra Titan would not be
possible.

Regarding Zookeeper: We actually build a locking system into Titan that uses
quorum reads/writes with time-outs and cleanup to ensure consistency for
certain edge/property types as defined inside Titan. This gives you
consistency guarantees out of the box without having to introduce another
component (like Zookeeper) into your deployment. For infrequent lock usage
(which I strongly encourage ;-) this should be sufficient. For frequent
locking, something like Zookeeper is far superior.

------
amirmansour
I'm liking the Apache license which pulls me away from Neo4j. Can anyone
please layout a comparison between Neo4j and Titan?

~~~
espeed
Titan is distributed and has much higher write performance (5000
transactions/sec). In the words of Matthias Broecheler, the creator of Titan,
you would use Titan to implement a social network
(<https://twitter.com/MBroecheler/statuses/213350753031569409>).

Titan has pluggable storage options: Cassandra, HBase, Berkeley DB, and some
have been working on adapters for DynamoDB and App Engine Datastore. However,
it's not clear how well App Engine and DynamoDB will work with the upcoming
Titan changes which will move the traversal engine into the cluster's JVM.

Both Neo4j and Titan support Blueprints, and you can run Gremlin
(<https://github.com/tinkerpop/gremlin/wiki>) on both. Marko Rodriguez
(<http://markorodriguez.com/>), the creator of Gremlin, is one of the
principals behind Titan.

See "The Benefits of Titan": [https://github.com/thinkaurelius/titan/wiki/The-
Benefits-of-...](https://github.com/thinkaurelius/titan/wiki/The-Benefits-of-
Titan)

~~~
espeed
UPDATE: The benchmark in this post shows write performance on standard EC2
machines to be:

"Titan's performance under normal (5,900 transactions per second) and peak
(10,200 transactions per second) load..."

