Hacker News new | comments | show | ask | jobs | submit login
Cayley – An open-source graph database (cayley.io)
289 points by _pius 1270 days ago | hide | past | web | favorite | 89 comments

For people here experienced with graph databases, do you typically use the graph db as your primary data store or do you use it in combination with something like postgresql? If you're using both, can you talk about how that works and if it's been successful for you?

I'm curious because I've had a couple situations where I thought using neo4j (or some graph db) would be a natural fit for something I wanted to do, but otherwise I thought most of my other data fit into postgresql just fine. My instinct is that if I'm doing this in a web app then querying from two different databases is going to slow down my responses a lot.

I have actually started with using neo4j as a primary store and then moved to using it as a secondary store next to postgres. I wrote some stuff up here: http://nambrot.com/posts/14-credport-technical-post-mortem/

It led to itself as neo4j was only useful for parts of our queries and fitting everything into neo4j was just hassle when most of our data was relational.

From your post: "The biggest issue was that we had data in the graph, that just didn’t feel right in the graph instead of a relational DB." -- What exactly was the problem other than managing complexity? I read through your posts and I didn't see any mention to some of the technical aspects of your issues with Neo4j. Was your data just so large that going the relational table route gave you a better understanding of this complexity?

i had the same history with neo4j

Actually you can have a graph database in postgres as well! Look into queries using "WITH RECURSIVE", and you can do pretty much anything a graph database could do. From the specific use case I had, there was actually no difference in performance between neo4j and postgres. I really enjoyed using cypher, and it was a pain to translate a query written in a graph-specific query language to a postgres equivalent with "WITH RECURSIVE", but because postgres was already a part of the stack I stuck with it.

how do you cope with cycles in the graph? do you even have cycles?

Depending on the data model there are a few ways to deal with cycles. My project involved a friendship graph, and doing queries like "find friends of X", or "find people who like X, who are within 2 degrees of friendship-separation from Y". These are problems where it was okay to have cycles in the graph, as the traversal depth was hard limited. You'll have more problems if you're doing questions like "find the cheapest path from A to B", although there is certainly a way to cope with cycles there as well.

I've got a lot of background in RDF graph stores. It depends a lot on your usage, but I think for your typical web app, you'd be better off using a Postgres install, and making use of fancier features like WITH RECURSIVE as necessary. Graph stores often miss out on features like guaranteed relational integrity and guaranteed constraints, which I find invaluable for safe application development in the face of concurrent updates.

Graph stores are typically much slower for repetitive data that fits cleanly into a relational model. This isn't to say they're not useful - for more irregular data they're a fantastic fit - it's just that very irregularly structured data isn't the common case.

Of course, you can always use two different stores - much like many sites do with a separate lucene/elasticsearch index for text search - but your graphing needs must be relatively componentised for that to work well.

Curious: What RDF triple stores have you used, and in what kind of application?

I was looking into using Stardog for a metadata repository I was building, but we ended up (probably unwisely) bastardizing Postgres into a bunch of self-join heirarchies.

The ones I've spent most time with were Jena/TDB, Virtuoso, 3store, along with a couple of proprietary engines. BigOWLIM is also a strong contender in the space. I've used them in the context of both object storage and semantic web data storage.

My experience is that if you don't need constraints/enforced relational integrity, RDF stores make for really simple/easy object storage. There's definitely a performance tradeoff, though - depends on what you need, really!

SQLite added CTE (WITH RECURSIVE) recently - http://www.sqlite.org/lang_with.html

Which ORMs support WITH RECURSIVE?

You don't really need any special ORM support -- even if you are using an ORM -- if you use appropriate views.

Hmm. I guess what I'm getting at is that I typically start building a web app by writing model objects that get turned into db schemas by the ORM (in Python, this is usually the Django ORM or SQLAlchemy), and the ORM turns attribute access into joins (either eagerly or lazily).

So an ORM that usefully interpreted model subclassing etc. and created self-joining tables and could query the resulting model using RECURSIVELY WITH as appropriate would be a real boon.

I don't currently use graph databases in production, but I do, however, have some experience.

I use both, in a similar vein of "using Elastic Search". It could be your primary store, but it's sometimes it's more pragmatic to have two sets, a "solid" base.

This is not to say that it can't be done. What I'm stressing is that larger "changes" are hard and difficult to handle - which means a lot in the start of your process, and less in the end, as in, when you're deciding how to model your data. For instance node layout (new properties? different type? other constraints?), and mass updates are also a bit cumbersome.

Usually I have more than one SQL table (naturally) since the data I've used in graph databases is mix and match (otherwise I'd just use a fixed schema and some relational DB).

-- As for "how that works", for me it's:

routinely update from my base database with queries alike: ID > last ID.

This has worked as expected, in terms of what data you get in, and which limitations you impose (e.g. timeliness).

I'm currently making a shift to running all data in my graph database as I've settled on a model (which edges, which nodes, which properties).

> querying from two different databases is going to slow down my responses

True, but depending on your data (do you know one of the queries beforehand - e.g. is your postgresql query enriching whatever your graph query returns) you might have success tying (inserting) some of the SQL data to your graph database.

A graph can do what a table can do and a lot more, but that's usually not the whole issue. In practice you need to consider things like speed, volume, scale, consistency, redundancy, computation, ad-hoc vs. planned operations, use of resources (disk, memory, CPU, GPU), etc. And as most NoSQL systems just aren't as mature as their table-based counterparts, you'll also have to factor in your tolerance for issues and general system crankiness. All that being said, some applications just cry out for graphs, particularly apps that involve items linked in pairs. Social apps (people linked by friendships), travel (places linked by flights), communications (people linked by messages), all of these can play hell with an SQL database but are naturals for graph databases.

I agree with the idea that tables are just strict graphs and as such a graph database is usually capable of substituting a relational database. I think many graph DBs lack a sophisticated enough query language to bridge that gap. At Orly (https://github.com/orlyatomics/orly) we're working on a powerful query language, and it's nice to see that Cayley is doing the same.

> querying from two different databases is going to slow down my responses

I think querying 2 different systems tends to be slower, but more importantly you lose transactionality. If you can use a single system that is at least on-par with your relational system for your run of the mill data and have a very powerful graph then that's a big win.

I work on a graph focused firm, XN Logic, where we use an unhydrated graph to store and analyze the relations, the appropriate store for large volumes of information, and Datomic to store mutations for the graph for history analysis.

We use the PACER engine (https://github.com/pangloss/pacer ) to power queries.

This approach allows you to get the optimal performance and only reaching out to other systems when needed.

I first started with neo4j as a primary data store for our semantic graph but there are some limitations that are forcing us to look for alternatives.

1. Adding edges to a neo4j graph is a painfully slow process. For a large graph with a few million nodes - it'll take days. 2. Scaling neo4j on a cluster is either not possible or it's a painful process. I'm yet to discover this.

However, the greatest advantage that neo4j offers is the ability to query a path. So far, no other graph databases that I know have this ability (including Apache spark and giraph).

It's quite possible to build a directed graph database as an adjaceny list in redis. We tried this and it's super fast and scalable. However, querying is very painful.

1. Adding nodes and relationships in Neo4j does not have to be slow. It really depends on how you are loading that data in. Neo4j provides many options for data import and a transactional endpoint over HTTP for batching transactions and decreasing disk write overhead.

2. The reason Neo4j is the only database that allows you to query a path is the same reason that setting up clustering or sharding is difficult. If your graph is complex then the problem is "How do I split up these subgraphs into shards so that traversals don't have to traverse across shards?" -- Building a giant adjacency list and using that as a traversal index is a clever idea, I must admit. :)

Openlink Virtuoso (and any other RDF store that supports them) does with the Property Paths feature of SPARQL 1.1.

Right but you have to use SPAERQL and from my limited experience with SPARQL it's not very fast either.

As someone else said, very much dependent on database engine. Some are faster than others, some scale better than others - it's about picking whats right for your requirements.

You should try again, depending on the software (i.e. which SPARQL database) it can be much faster than neo4j.

If you query the two databases parallel then the response time should be equal to the slowest one of the two database responses (not the sum of them). But if you use two database then you have to maintain both of them and if they are on the same server then sharing the same resource could make them slower than just using one db.

We use an RDF datastore (OpenLink Virtuoso, clustered edition) as our primary datastore. We use it in combination with Apache Solr to provide fulltext search over various resources that we extract and pass through an Indexing pipeline to go from RDF Graph -> Search Document.

It's worth noting that Virtuoso (produced by my employer, available in free Open Source and paid Commercial variants, http://virtuoso.openlinksw.com/features-comparison-matrix/) is a hybrid Relational/Graph/XML/FreeText storage and query engine, which natively supports SQL, SPARQL, XPath, XQuery, and many other open standards. It might satisfy the OP's needs on its own.

Virtuoso's support for open standards makes it easy to use it as a complete solution covering all the bases, or, as in @philjohn's case, to plug-and-play with best-in-breed solutions along any axis where our implementation proves not to serve your needs for any reason. (We do want to know how and why we don't measure up, so we can improve that aspect!)

I use only object databases. No relational. BTree response times for the indexed objects so life is good.

Some projects I use NoSQL, mongo.

I do use Redis for reactivity and caching (but redis isn´t a database so)

Congratulations for the progress, thanks for sharing your work. I do want to see more great odbs with friendly APIs

Currently, I'm using a graph database (neo4j) in a project about ontologies.

Interesting 'triple' they use :)

    // Our triple struct, used throughout.
    type Triple struct {
        Sub        string `json:"subject"`
        Pred       string `json:"predicate"`
        Obj        string `json:"object"`
        Provenance string `json:"provenance,omitempty"`

This is weirdly ubiquitous in the RDF store world as well. I eventually gave up talking about 'triple stores' and called them 'RDF stores' instead :-)

It is basically required for SPARQL 1.1. (Named) graph support. Often people call them quad stores.

Seems like provenance ought to just be it's own triple describing the relation of another triple to some source. Is this just a shortcut?

I can't get over the logo being a 3-colored version of a 2-colorable graph.

That's intentional. It represents robustness and scalability: you can add an edge between a red and yellow node, without having to re-color.

(actually I've just made that up)

It would probably look better without the grey nodes on the ends anyway.

Even Arthur Cayley might have smirked at that one. :)

I've been using a graph db as my primary database with my last project(neo4j) and it has been a pleasure. I wish a good graph database was hooked in with the new Google Cloud stuff so it could be queried/visualized/performance analyzed in a similar fashion to the BigTable demo earlier

I'm curious as to how you're dealing with the licensing costs, as they shut down most hopes I had. Granted, Neo4j said they had a "startup" license but due to family issues i never followed up. Still seems a bit extreme to me for the price, but I also (at the time) couldn't really compare it to much else.

And on the topic of Cayley... I'm stoked. I'll give it a try tonight. Easy to use, and graph databases, haven't really gone hand in hand for me.

> I'm curious as to how you're dealing with the licensing costs, as they shut down most hopes I had. Granted, Neo4j said they had a "startup" license but due to family issues i never followed up. Still seems a bit extreme to me for the price, but I also (at the time) couldn't really compare it to much else.

Did you want to get support from the company that makes neo4j? If not, then you don't need a license.

Not entirely true. The only free version is the "Community Edition", which runs on a single node; no clustering support, no monitoring, no hot backups, no caching. [1]

It's pretty much useless in a server environment since even replication isn't possible; if you want a redundant setup -- which you will -- you will have to keep the nodes synchronized yourself. Not to mention that since Neo4j is an in-memory database, it puts a hard limit on your dataset size.

They have a "Personal License" [2], but it lasts for one year and you're not allowed to use it if you have capital funding or a certain amount of revenue.

[1] http://neo4j.com/subscriptions/

[2] http://www.neotechnology.com/terms/personal_us/

Actually much of this is not true. * All Neo4j versions have caching * Neo4j Enterprise is available for free for any AGPL project and for personal use and early startups * Neo4j is not an in-memory database, it is a persistent, fully transactional database, it uses the available memory for caching the hot dataset

> All Neo4j versions have caching

I was actually referring to the "High-Performance Cache" mentioned in the feature matrix.

> Neo4j Enterprise is available for free for any AGPL project

I was really talking about a commercial setting. How many companies deploy a fully open-source project (ie., honouring the requirements of the AGPL) in a redundant data center? Not a lot, I imagine.

> Neo4j is not an in-memory database

True, it seems I was misinformed about that.

I guess I worried licensing might be an issue with large data sets: http://neo4j.com/subscriptions/

Where using more than one instance would be desirable (I'm not trying to over optimize-- I swear! ;-_-). I suppose one server can go pretty f#€£ing far. But all the clustering requires a license (~$12k per server for start ups?) unless the code which uses it is GPL or AGPL (I think?).

Anyone who wants to explain the AGPL in layman terms would be greatly appreciated. Or specially Neo4j's application of it. To me, it seems cost prohibitive for lean startups, of one or two people.

Again, they've said contact 'em-- so it may be case by case. I could just be worrying about non-existent problems.

I tried Titan and Orient, I didn't find either as nice as Neo4j to use in my code. Though Neo4j, for me, was much more difficult to setup and use as a cluster (at the time).

... And more on topic, I'm about to install Cayley!

neo4j uses the GPL (specifically version 3), not the Affero GPL. You can build a service based on the community edition without sharing the source of your service; you only need to distribute source (under a GPLv3-compatible license) if you distribute your service to others.

(Summary of the Affero GPL: you must distribute the full source of your service, under an AGPL-compatible FOSS license, to the users of your service. Summary of the GPL: you must distribute the full source of your program, under a GPL-compatible license, to anyone you distribute your program to.)

It does look like the "Community" edition leaves out the clustering features, so if you need those you'd likely need a license for the proprietary version.

Neo4j is an open source graph database. The community and enterprise editions can be used in production without a license (including HA clustering). Licensing is required when you choose not to open source your code that uses Neo4j. When you have a commercial application of Neo4j, the price for the license is well justified so your customers are not impacted when an issue arises in your implementation. Most resolutions are a matter of creating an unmanaged extension that runs code embedded in the JVM.

If it fails in interesting ways, let me know, file a bug, etc. I'm more than happy to help.

are you using it for production purposes or for your small projects?

if it is for production, how is your read/write performance?

Neo4j wants everything in memory, so the bottlenecks would come after your data-set size outstrips the memory available.

Fast disks are also important for write performance since Neo4j syncs every change - big RAID arrays of SSDs help.

This is one place where Log-Structured-Merge systems can really prove themselves. There is rarely a need to sync on every change as most of them are independent. You can usually gain lots of write-throughput by syncing at the speed the hardware is optimized for and squeezing as many append-only changes in to those logs. At Orly we've spent quite some time looking at ways to deal with these bottle-necks.


if you plan to grow with it, i strongly advise you to battle-test it...

One question: I'm curious what your motivation was for providing a Gremlin style query language? As opposed to something like Cypher? Was it a case of expressiveness, personal preference etc....?

I wanted something easy to pick up -- Javascript seemed natural. I also wanted to be as agnostic as possible, because graph query languages are interesting and worth experimenting with. That's why it has submodules per-language and it's (relatively) straight-forward to write another.

Cypher's not bad either, but it's not "just Javascript". But I'm totally taking commits if someone wants to port it :)

I'm a developer on Orly and I focus on the language portion of the project. Orly takes a different approach in this regard by providing a general-purpose functional language as its "query" language, rather than providing a language specifically designed to traverse/query the graph. This approach allows us to perform not only queries but also arbitrary computations on the server-side and take advantage of the powerful resources available on the server such as large memory, CPUs and perhaps most interesting are GPUs.

Getting performance out of a graph often depends on being able to express your query effectively. The last thing you want to do is plunk down a cursor on a node and have your client start wandering around the graph, following edges. All that back-and-forth chattiness is a non-starter network-wise and it gives the database engine essentially no chance to optimize the query. I've been favoring scripts that let me use free variables--like Prolog logic variables--to describe shapes in a graph and then let the database server find bindings that match. Like 'print a.name, b.name for a, b, x, y where a is_friend_of b and a is_friend_of x and b is friend_of y and not x is_friend_of y and x.city == y.city'

Almost all graph databases use gremlin as querying language but I really love orient db's approach of using sql[1] for querying graphs. It feels more natural in my opinion plus it lowers the barrier to entry for people who already know about sql.

Some of the things I like about cayley

1. Switchable backends(I wonder if I can configure it to use couchdb as a store)

2. Documentation get right to the point. When I first tried my hand at graph databases I could not understand where to start but cayley's approach is pretty straight forward and it wins plus points from me for including a big dataset :)

A question: I see no mention in the docs about running it on multiple nodes (where does it stand with regards to CAP etc)


Has anybody tried out Orly? https://github.com/orlyatomics/orly

I have. I'm on the team that's been developing it for the past four years. It's nice to see graph databases getting some popular traction at last. (I was into them before they were cool, of course.) I've only just started looking at Cayley but it looks like there are some significant differences between the projects. Orly is designed for high-speed, high-volume applications that need large-scale storage and consistent transactions. We're more OLTP than OLAP, which seems to be the way Cayley leans. Most graph systems tend toward analytic applications.

Could Cayley use Orly for a storage engine?

Maybe so. Orly has a somewhat leveldb-like component called a Repo which might slot in well. A Repo is a log-structured merge storage system that also provides indexing, access to previous values, and, most importantly, consistent reads and writes, even across multiple Repos. It also makes good use of resources (RAM, SSD, HDD) to optimize performance and cost. (Also working on letting it run on GPUs, for insanely high performance for people with the power and air conditioning budgets.)

I've seen a couple of mentions of graph dbs on GPUs in this thread, and it does seem like a (somewhat) obvious fit. Anyone aware of any projects that make (good) use of that right now? Not necessarily a stand-alone service, but also things like an embedded graph db ("berkley db for graphs") or something like that?

BigData by systap is working on this combination see this blog post (http://blog.bigdata.com/?p=658)

We would clearly need to do a deep dive but since Orly was designed with modules in mind it should be feasible to use the storage engine underneath Cayley.

Awesome - all my work is with graph data and graph databases any additions to the space are great news.

Can anyone provide a working example for the visualization feature? The docs say that you can use the Tag functionality to label source/target nodes for sigma.js rendering, but bridging the gap between that suggestion and the actual query does not seem trivial.

Sad to see no SPARQL support as of yet, it looks like it's on their longer term goals though as they are query language agnostic.

Interesting to see more products entering this space.

Has anyone tried out Tinkerpop?


I played with it, and it was kinda fun.

Demo version somewhere?

http://cayley-graph.appspot.com is a live demo on App Engine.


> Not a Google project

  Not a Google project, but created and maintained by a
  Googler, with permission from and assignment to Google,
  under the Apache License, version 2.0.

and obviously, not "google's new graph database".

It looks like a Googler's 20% project.

However, I'll point out that most of the actually-used open-source projects coming out of Google aren't actually "Google projects", they are "projects by Googlers released under the OSPO process". LevelDB (Jeff Dean & Sanjay Ghemawat), Protocol Buffers (Kenton Varda), Guice (Bob Lee, Jesse Wilson, and Kevin Bourillion), Gumbo (myself), and angular.js (a team within DoubleClick) all started out as small internal projects built to scratch an individual's itch that were then released externally because hey, why not. The "corporate" open-source projects have been things like Android, Chrome, GWT, Closure, Polymer, etc.

Exactly this. And I'm the Googler. Hi!

Hi. Former googler who does work that Graph Databases would be useful for. If you don't mind I'd like to ask:

Does it have bulk import and if so what is it's speed for bulk import rougly speaking?

It does have bulk import; I've been loading largish subsets of the Freebase dumps.

Load speed is pretty good (into persistent storage, I assume) and can be improved with some of the database parameters. A rough estimate is that a million triples or so takes about 5 minutes, but that slows down as it gets bigger. 134m triples took me 6-8hrs, so I slept on it.

Since no one's mentioned it yet, another alternative to Neo4J is AllegroGraph (though you need to pay for it; the free version supports 5 million tuples).

It _does_ support bulk loading, with over 500K triples per second. According to http://franz.com/agraph/allegrograph/agraph_benchmarks.lhtml, given enough RAM, it can load over a billion tuples in just over half an hour.

Virtuoso 7.1. and OWLIM 5.5 have similar loading speeds. In this case the decompression algorithm is often the bottleneck needing multiple files to be read in parallel to go faster. Oracle 12c Semantic Network and Yarcdata uRiKA can also go faster in loading if correctly set up.

What kind of usage is this seeing inside Google?


We changed the title to the one on the page.


Woo, this is probably the coolest project from Google in the last while. Is there anybody using a graph database here? What is your use case? Would you mind sharing?

At my company, we're using graph-based models for most of our NLP stuff (mainly entity extraction ATM) and our ANNs (mainly object classifiers).

- Jonathan

Thanks Jonathan. I am not terribly familiar with NLP would you mind resolving the 3 letter acronyms (ATM, ANN)?

  OS X:  Homebrew is the preferred method.
Over my dead body. Homebrew messes up my /usr/local, leaving its heaps of crap everywhere. I don't find that acceptable. MacPorts at least has the decency to put itself into /opt/local, which isn't normally used by anything else.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact