Hacker News new | past | comments | ask | show | jobs | submit login
Graph query languages: Cypher vs. Gremlin vs. nGQL (nebula-graph.io)
128 points by jamie-vesoft on March 4, 2020 | hide | past | favorite | 64 comments



I use cypher most days over the last few years and I like it because it's a complete level of abstraction over the database.

In gremlin and this new nGQL, the whole notion of "INSERT," implies that the graph is just a representation over top of a relational db, and power users should understand RDBMS concepts articulated in SQL queries first with this novel and cute graph thingy after for managers. It's like asking Excel users to understand and care about pointer arithmetic.

Gremlin leverages some peoples mental sunk costs in SQL, where cypher is very verbose, but lets me reason purely about my graph model without hacking over the implementation. The others aren't bad, but the people who will be using graphs won't be DBAs.

In this sense, cypher/Neo4j isn't a competitor to noSQL and rdbms products, it's a competitor to spreadsheets, where the majority of people actually get work done themselves, instead of specifying it and having others engineer it.


I was really attracted to graph databases mainly for the ability to do joins in effectively constant time rather than O(log N) time.

But then I realized that sharding and localizing data can accomplish roughly the same thing.

Also the graph database doesn’t have to duplicate data so much for joins, saving on memory.

If you are going to have a huge dataset, build your data as RDMBS first and then make a cache in a graph database.

I say this only because graph databases are not mainstream yet.


I came to exactly the same conclusion after having Neo4j pushed on a project by managers who'd been sold by their "it's great for everything!" marketing. At least as of ~2 years ago, no, it wasn't. Fine for a narrow set of query types for data of a very specific kind of shape (dense graph) that you don't care about much and can re-generate if it gets screwed up. Unsuitable as a "database of record" (poor integrity enforcement, transactions very limited) and quite bad (slow, awkward) at a lot of things one might want to do with it, even kinda "graphy" things, that didn't happen to fall in its strengths. Memory hog, too. Seems like most graph databases, though possibly better at some of those things, will tend to be similar, since they have to make trade offs to achieve notably good performance at whatever they're benchmarking to put on their marketing pages.

Cypher was really nice, though.


Spot on with my experiences. I had a great honeymoon period with Cypher via Neo4j but then the cracks quickly started to show.

It was a valuable learning experience but I find myself moving back toward SQLy and JSONy things for production models.


What version did you use? What you are describing sounds like Neo4j from around 6 years ago, not today.


About two years ago. Whatever was current then. We had a paid license too because this client was all-in on N4J. They had their own internal champion who'd set himself up as the "Neo4j expert" and got them to send him to conferences and hang out on phone calls and try to fix the fires he was partially responsible for but didn't get blamed for. All their projects had serious issues resulting from their insistence on a particular stack, which had basically nothing really capable of protecting data consistency at any point. We had to fight for TypeScript (vs. their prefered vanilla Javascript) to gain a tiny semblance of sanity, productivity, and stability in that environment.

Transactions in N4j couldn't handle modifying one entity [edit:the schema of one entity type, I mean] while updating another, which was pretty limiting (say, you want to do safe Rails-migrations-style version bumps in the DB in the same transaction as your modifications, so they can't get out of sync). Constraints capabilities and data types were very limited. Perf if you stepped off the Golden Path, which was easy to do by accident with something that looked boring and normal, was mediocre at best for our use case (smallish sparse sub-graph fetching, mostly)

[EDIT] meanwhile, all the official material from N4J was doing its MongoDBest to sell itself as 100% suitable for production for 100% of use cases, because of course it was. Look anywhere else and you got a very different, more accurate story.


I posted a thread recently about my struggle to evaluate neo against multi model alternatives but have not gotten any useful replies. Any chance I could pick your brain? Email in my profile.


No mention of Datalog? http://www.learndatalogtoday.org/


Datalog is such a delight to use especially since queries are just data structures are just code. Once the basics clicked I felt empowered to do anything in Datalog, while I feel like I always have to learn or remind myself of more syntax when I want to do anything fancy with SQL.


FWIW there's a Prolog-based graph DB called Terminus: https://medium.com/terminusdb https://github.com/terminusdb

I don't know much about it but they are associated with Seshat Global History Databank http://seshatdatabank.info/ which, if I understand correctly, is something like a serious attempt at "psychohistory" (like in Asimov's Foundation.) https://medium.com/terminusdb/stuffing-the-whole-human-histo...


Thanks for TerminusDB. I'm currently implementing a toy datalog-based graph DB in Clojure backed by RocksDB.

(I wish Terminus didn't publish articles on Medium, where you need an account to read)


Sorry, but this article fails big time: no mention at all of SPARQL.

For application developers, having access to general public Knowledge Graphs like DBPedia and WikiData can be a very good resource.

While I am also a big fan of more general graph databases like Neo4J, not even mentioning SPARQL is such a HUGE OMISSION that I have to suspect some commercially motivated bias in this article.

The decision of which graph data platform to use is not always black and white. Use SPARQL with a RDF/OWL data store or a more general graph data store like Neo4J as appropriate. Learn both technologies.


I assume it's not mentioned because it's not one of the Query Languages mentioned as part of the standardisation project.

  'GQL is an upcoming International Standard language for property graph querying that is currently being created. 
  The idea of a standalone graph query language to complement SQL was raised by ISO SC32/ WG3 members in early 2017, and is echoed in the GQL manifesto of May 2018. 

  GQL supporters aim to develop a rock-solid next-generation declarative graph query language that builds on the foundations of SQL and integrates proven ideas from the existing openCypher, PGQL, GSQL, and G-CORE 
  languages. The proposed SQL:2020 Property Graph Query Extensions already build on these existing languages.'
https://www.gqlstandards.org/home


I agree that SPARQL must be considered, and so must Datalog and Prolog. Idk but I'm starting to believe that new-fangled standardization efforts (if you want to call those that) are starting over from scratch and actively avoid looking at prior art as a generation thing. I mean I don't know the people behind gql but I can get that younger devs think there's absolutely no reason that 1985-2010ers had all the fun.

Though I could see why someone wouldn't like SPARQL and RDF, with its bulk reuse of other W3C and TBL concepts such as URLs, resulting in atom and predicate names verbosely and pointlessly beginning with "http://".


How are DBPedia & WikiData useful for application developers? As a reference or for actual integration into applications?


I use WikiData as a (non-realtime) data source for Swymm.org. I've written a bunch of pretty intense SPARQL queries as such and I agree that it's odd that SPARQL is not mentioned in the post.


Depending on your domain, they can be super useful, if not indispensable, for Named Entity Recognition.


Is there any SPARQL implementation that returns results under 10 seconds on a big dataset? Because, I never found a public SPARQL endpoint that gives remotely acceptable response time.


While there may be several things missing for many productive use cases (especially inserts/updates), I think QLever (https://github.com/ad-freiburg/QLever) fits that description very well. There's also a public endpoint linked there.


Under 10 seconds on which query? If your query involves big joins over a large distributed data-set there won't be a technology that can do it for you. SPARQL is not the problem, you can write the same queries in Cypher or any other language, you will hit the same performance problems


I'm not sure exactly what you mean by implementation here, but many (most?) of the Wikidata examples (on its public endpoint) are very fast, i.e.: https://query.wikidata.org/#%23Cats%0ASELECT%20%3Fitem%20%3F...


Of course there are. The SPARQL endpoints open for public access at no cost for the user, and potentially accessed by i-dont-know how many clients concurrently can't be used as benchmarks.


Someone already mentioned that public endpoints aren't good benchmarks.

But there are many performant SPARQL-enabled databases (back when I wrote my Master's in 2014, that even included Oracle), and they are indeed quite performant - though there are details like batch/realtime materialization and the like.

In my experience, AllegroGraph was pretty fast enough (apparently someone even uses it now to translate between HL7 schemas of multiple providers in USA)


There is also EQL. A query language written in EDN

https://edn-query-language.org/

Once it use vectors/maps to describe the query, you can compose/generate/transform with simple data operations (concat, filter, assoc...)


We evaluated umpteen graph dbs this past year and chose vanilla Postgres instead because Neo4j/RedisGraph have insane licenses.

It’s useful you’re comparing these languages. I would suggest to make yours more like Cypher, specifically the arrows <-[]-> are much less verbose than BIDIRECT/ REVERSELY, and MATCH is a lot less verbose and more powerful than other query languages. It sucks to want to use Neo4j then not be able to due to license and business issues.

If there were a quality graph db with a permissive license and Cypher, serverless hosting, search and JSON, we’d use it... neo4j didn’t work out because they require an NDA to get a price quote (!) and the Redis Source Available License basically reads, “you can’t use this for startups”; RedisLabs.com quotes a “low” price of $500 monthly to get modules with basic stuff like JSON, Search, and Graphs (“cloud pro”) - but then the pricing page triples that number. We pointed this out to redislabs at least 3 different channels (email, git, Twitter) but the pricing error still exists on their cloud page. If RedisLabs leaves an $800 / mo typo sitting on their page for months, how do you trust them with sensitive customer data? Went with Amazon Aurora PostgreSQL instead. Love Row Level Security (but wish you could specify columns inside your row policies)

You might also include ArangoDB AQL


Neo4j is licensed as GPLv3 unless you want the enterprise features (replication). I've run the non-enterprise version in production and it was working fine for a limited workload. Replication would have been nice at some scales but it wasn't the reads that were the issue anyway, it was the writes which replication wouldn't help anyway.

RedisGraph is licensed under their weird license but as far as I can tell you just can't expose the RedisGraph API directly to your customers. Building an API on top of it that adds some abstractions should be fine (think a social network powered by the extension). I could be wrong about that though.

Are you actually doing graph work in Postgres? I learned about recursive CTE queries once upon a time and that was what prompted adoption of Neo4j.

(I'm not a lawyer and don't take legal advice from the internet)


You should take a look at ONgDB https://www.graphfoundation.org/projects/ongdb/

and might find some of these blog posts helpful https://blog.igovsol.com/


Interesting story with Neo4j and RedisGraph. Thanks for your suggestion. We are going to support OpenCypher. We will also take a look at AQL and include it in our next version. :)


I rather like Cypher, easy to get into with the (node)-[edge]->(node) construction, difficult in the middle (until you realise that WITH is very different to SQL's), then a delight. Gremlin, so they let Java's horrible camelCase leak into their syntax? Oh my ...


You write Gremlin in the programming language of your choice and with that comes the idioms of that language. For Java that means, camelCased syntax as that is what Java developers expect. So a function like hasLabel("person") looks right to them. In C#, that same function is HasLabel("person") and in Clojure it is (has-label :person) and so on. Gremlin isn't meant to be an embedded string within your programming language. It is actual code within your programming language.


Well I stand corrected. I was judging from the examples in the linked article, so will have another look ...


I've been using Cypher (neo4j) for a bit over a year now, and I low it. We'd need a dozen JOINs to do some of our queries in SQL, but in Cypher it's very natural to write complex relationships.

Overuse of WITH does make it a bit more imperative, but also makes it easy to shape the result in the form you want it in. Cypher has been a joy to work with. Gremlin, by comparison, does not attract me at all.


Granted, Java was one of the first programming languages I seriously learned (in university no less). The camelCase syntax became somewhat of a second nature to me, and indeed, I even internalized the idea that it simply looked better that way. I understand that syntax preferences and capitalization are mostly just preference, but are there strong arguments against the use of camelCase, or rather, in favor of something different?


If you have any opinions on GraphQL vs Cypher, I’d really appreciate it!

Context: In web app usage scenario the tooling available for GraphQL seems more popular (with react libraries) than Cypher.


Completely different use case and purpose.

Cypher is a graph query language.

GraphQL is an API query language to query tree like structures (trees...), nothing to do with graphs actually.


GraphQL can be totally be used to represent and query true graphs, not just trees. The data gets materialized into nested JSON (which can be thought of as a tree) but that doesn't mean it doesn't represent a graph.

Other than that though, you are quite correct, the two do not serve the same purpose.

Neo4j has an extension that exposes a GraphQL API that converts the queries to Cypher (https://neo4j.com/developer/graphql/#neo4j-graphql-java). I've used it for some basic things but for anything complex you need to be careful and inspect the queries it produces.


How would you query for a path or cycle of arbitrary length in GraphQL?


You cannot do that directly out-of-the box, but you can write your server to include directives for that. See some discussion here: https://github.com/graphql/graphql-spec/issues/91

Also just because you cannot query at arbitrary depths doesn't mean that GraphQL doesn't work with graphs, just that it might be a poor choice of tool for that work.


Thank you, everyone in this thread! These are some helpful insights.


DGraph is perfectly capable of doing these with a query language extremely similar to GraphQL (and as of v2 will natively support GraphQL):

https://docs.dgraph.io/master/query-language/#k-shortest-pat...


camelCase makes sense in golang's "Go Way", as minimizing variable name within its scope is promoted. Then, using underscores makes for more difficult read (words separated). CamelCase names are exported by language semantic.

So it doesn't need to be bad necessarily, if code follows the same conventions.


I may be missing something, but to any of them let you return information that is not a node in the underlying graph? I had a project once where a user request would result in a graph that was derived: result nodes and edges were derived based on the database’s nodes and edges. Rolled my own system because nothing supported that.

Related, do any of the let you do something simple like return a count of vertices obtained from a traversal, or do you need to walk a result and count them yourself?


A new "graph query language" that doesn't even mention Prolog is like describing "a new way to create illness immunity by injecting the body with a weakened version of the virus that causes the illness" or "a new computer architecture wherein both instructions and data reside in random-access memory."


Is anybody still using SPARQL?


Yes.. I work on several projects that leverage SPARQL. The article is also remiss in not mentioning the work the W3C is doing along with Neo4J to do its own alignment (https://www.w3.org/Data/events/data-ws-2019/). Indeed, this article seems very self serving in its omissions. There is a follow up meeting planned soon for that too.

Also, much of the work in validation (SHACL, Shex) of graphs leverages SPARQL so it's not going anwhere soon. I would like to see it evolve to allow more vertex based searches without the need for extensions though.


From your link "W3C's RDF uses URIs (Web addresses) for nodes and link labels in directed graphs. This has the advantage of enabling them to be dereferenced to obtain further information, making for a Web of linked data. In particular, nodes can be dereferenced to graphs on remote databases."

I think spreading those kind of lies[1] if a part of why there is a divide between the W3C (at least the RDF community) and the rest of the word. The W3C should be more transparent about RDF capabilities and realistic about the real power of the Semantic Web.

[1] this false because RDF use IRIs not URLs, and not all IRIs are dereferenceable. Moreover, even URLs used as resource id are not constrained to be dereferenceable per the spec (and when they are, you'll get a lot 404 in practice). Also, the same effect can be obtained as easily with properties (key value pair) attached on nodes of graph database.


I think the reasoning behind RDF using URIs/IRIs for nodes and predicates is that this gives a globally unique naming hierarchy (backed by DNS in turn), such that when combining heterogenous data there's no name clash, and at the same time allows for liberal use in closed DBs via "urn:" URLs. But yeah, if the linked article insinuates that dereferencing RDF URLs is useful or even common, that would be false IMO.


Yes, thank you! That's one of my biggest gripes with the W3C stack. How am I supposed to build semantic data, if I can't access the semantics and/or they can change or disappear at any time?

That has been one of the main motivations why I've been working on a content-addressed semantic data/ontology format[0] with a modern decentrelized stack in mind.

[0] https://github.com/rlay-project/rlay-ontology


Yeah, the SPARQL people need to realize that the semantic web was a massive dud in the programming and database market, and a lot of that was overreach, overpromise, and a lack of focus on "real-world" problems.

Thus, if someone is looking to unify graph QLs that are in actual use in "business" problems, SPARQL and the overall RDF aren't going to get attention. You can start with the fact that RDF basically assumes you want to globally address all your data with URIs, which will result in ridiculously verbose overhead in naming/addressing. Nevermind the fact that such things basically promise some sort of long term durability that the actual web has shown doesn't exist. After all, today's URI link to URI www.tla.com/link/to/some/data can mean the world wide wrestling foundation one day, and the world wildlife foundation the next.

In particular, Gremlin was adopted by DSE / Titan / successor to Titan which ran atop Cassandra for near-limitless scalability.

RDF and the Semantic web, while being intended for the massive WWW, seemed to not have any care for demonstrating techniques, queries, and architectures at scale.

Likewise, are Datalog and Prolog used extensively?


Both Prolog and SPARQL/RDF/Semantic Web are used at scale and pretty extensively. Unfortunately often behind closed doors, but there are very performant systems involved.

URIs and RDF in general don't need to use public HTTP links or anything like that, meanwhile the layered systems like OWL and RDFS provide some impressive features for implementing complex systems, especially when you actually want to use a semantic graph instead of loosely-schemed bag of nodes and vertices common in non-RDF graph databases.


So sqarql rdf should stick to it's limited application space and let more generally applicable graph technologies go their own way


Except it's actually a superset of those simplistic models.

It's just not getting the "kool" looks and gushing reviews, mostly written by people who totally missed all the previous research. Which is a common problem all around in computing.


Datalog is used fairly extensively in the Clojure world, in Datomic and a handful of other implementations.


Yes, it is still really well alive but does not benefit from marketing money, so it flies under the radar. A new version SPARQL1.2 is currently under work through a working group in W3C to address some of the limitations of the current version.


I am using SPARQL on a commercial product I am developing and also for a future product (I am a solo developer, so one product at a time).

As I mentioned in another comment, the SPARQL endpoints for WikiData and DBPedia are fantastic resources, for some projects.


Yes, it's quite big in bioinformatics, pharma, chem, medicine.


Have any of you checked out TigerGraph yet? I have heard if you have run into scale issues with any other GraphDB, to check them out. First glance, benchmarks blow Neo4J out of the water... GSQL seems intuitive, anybody have any experience?


They forgot to mention that Cypher is also supported by AgensGraph


Hey Thanks for pointing it out. We'll add that soon.


Unrelated to the content, but can you please not put white on yellow like that? It's basically impossible for me to read with cataracts.


You meant the pictures right? Sorry for the bad experience. Will definitely improve our color palette for future posts.


Thank you :)


Huge fan of cypher. Used it for a previous start up, including the data store for an OAuth2 implementation.


Good to know. Cypher is quite popular indeed.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: