For example: (:Person)-[:BEFRIENDS]->(:Person)
If we'd want to store a date with that relationship, Neo4j got your back, that's entirely possible (= relationships can have attributes). But now our requirements change and we'd also want to have an entity for Events shared by friends (e.g. friendship anniversary), now we have to remodel our data to something like:
In SQL that wouldn't have been a remodel, because there's no difference between a relationship and an entity. We would've gone from:
So I feel like the vertice/edge distinction Neo4j makes, gets in the way of changing data model needs and I ultimately think that modeling your data as a graph is not helpful. Though it can be extremely helpful in querying and that's where its biggest strength lies.
This is the essential difference between neo4j (a property graph engine) and most RDF stores which support quads (i.e. <subject> <predicate> <object> <context>)
If A and B share an event via a friendship, you can keep track of things about that. Granted in an RDBMS if all you wanted was to draw the line then you could do it with an extra FK, but I think the conclusion you're drawing is going too far, specifically:
> In SQL that wouldn't have been a remodel, because there's no difference between a relationship and an entity
There is a difference in SQL; relationships are EITHER more columns, OR a join to another table, both are possible. In graphs, "hyper relationships" (e.g. relating more than 2 things) require another node, but this is apples/oranges comparison.
I hope something like this becomes as mature and usable as Neo4j.
A month later we rewrote everything in SQL - the main drivers were:
- as we refined our model, we realized that a relational DB with a bunch of join tables was good enough
- our developers were more comfortable working with SQL
- it wasn't possible to run complicated queries involving both databases simultaneously
- the Rails ORM felt easier to use than the Neo4j Ruby APIs (though this was certainly a function of our own familiarity with Rails and relational databases in general)
- having the extra database complicated our codebase and complicated our deployment
There was nothing horrifying or surprising in our encounter with graph databases. It just felt like we just made the wrong initial architectural decision. We were still trying to define the problem and were trying to use something we didn't fully understand.
I'd hesitate to use graph dbs in the future unless I needed a high-performance app with a lot of data that only a graph could model well. Otherwise having two different types of databases is annoying.
While a lot of the work I do is covered by NDA, one problem that I've applied it that I can talk about is analyzing basketball play-by-plays. I've spent some time talking to the analytics team at an NBA franchise, and it turns out doing interesting analytics on play-by-plays can be a surprisingly tough nut to crack. RDF was a great tool for tackling this. Here's the source (written in Scala), for anyone interested at having a look: https://github.com/andrewstellman/pbprdf
Sports events seem a good example. What made it easier for you in your example of basketball play-by-plays with RDF?
Taking the first example https://github.com/andrewstellman/pbprdf#example-analyze-a-s... and translating it into an RDBMS approach seems rather straight forward:
GameEvent (PersonA, EventType, PersonB, Game, Time)
Roster (Person, Game, Team)
SELECT Team, COUNT(Team)
JOIN Roster ON GameEvent.Game = Roster.Game AND GameEvent.PersonB = Roster.Person
WHERE GameEvent.EventType = "foul"
GROUP BY Roster.Team
Period (Number, Game, StartTime, EndTime)
.. WHERE Period.EndTime - GameEvent.Time < 5 ..
That's definitely an important benefit. RDF makes it really easy to introduce changes that not only don't break the existing schema, but can be entirely isolated or combined in queries.
Another thing that RDF makes easy is analysis that takes advantage of a graph -- using relationships with other players, shots, etc. What players have the highest percentage making 3-point shots in possessions immediately after a player on the other team missed a 3-point shot? Building queries you compare previous possessions, shots, quarters; players' relations to each other (e.g. performance players who were subbed in after previous teammates went scoreless for 3 possessions) -- these things are a lot easier to do in RDF than in SQL.
Obviously, there are many things that are easier to model in RDBMS and query in SQL than with RDF/SPARQL. Every tool has its uses.
A few years ago I put together a quick GUI in C# to make it easier to run SPARQL queries: https://github.com/andrewstellman/sparql-explorer
I haven't found an RDF editor or visual tool that I like. Some people like Topbraid Composer: https://www.topquadrant.com/tools/modeling-topbraid-composer... (commercial, closed source)
PS: I am the maintainer, and we currently have some certificate issues. Drop me a mail to firstname.lastname@example.org if you want to be notified when the problem is solved.
For as important as RDF has been for the web, it feels increasingly less "web native" today as most efforts still seem to be highly Java-focused and Browsers mostly don't run Java anymore.
(I ask because a silly project idea I have some tiny amount of notes for is something of a Twine competitor. I realized that while the language ideas I'm exploring don't look like SPARQL or other graph languages, there is a bit of an overlap conceptually under the hood and an RDF store might make sense as a bootstrap tech, but brief searches didn't turn up anything useful.)
Note: I have never understood the need for a DB on the client-side.
In another thread GUN is mentioned, at that may be the closest client-side DB to what I think I'm looking for, if I ever get around to that side project.
That said, we've done some work to prevent runaway queries (e.g. strict query timeouts, downstream systems that handle that situation gracefully).
I know that a decent RDBMS (simplified) will consist of the following:
- data in blocks organised with a block-size that the underlying filesystem likes
- a cache for the most frequently used blocks
- every index is a B-Tree with pointers to the blocks containing the tuples
Then there are column stores as well as row stores, and for compression you might have some dictionary encoding going on.
Now, how does the Graph Database look under the hood and what are the complexities involved? How is the Graph persisted?
[Neo4j internals can be seen here](https://www.slideshare.net/thobe/an-overview-of-neo4j-intern...)...it's a bit old but I think mostly still accurate.
In graphs you have to persist nodes and edges, though you may partition nodes by label/category. In the case of neo4j there is a property store rather than a set of columns.
In the graph databases book (graphdatabases.com) there is a chapter on the internal architecture of Neo4j.
Consider the following case:
Client A issues a query -- starting from a vertex, conduct bounded closure search, giving every visited vertex a mark (coloring, or lexical flag, whatever you would expect from a graph algorithm)
Client B issues a query -- clearing any marks applied to a particular vertex, which happens to be one of the visited vertex of Client A's query.
Now, race condition aside, let's assume we first process query A then B. Would we allow query B to succeed? It is clearly possible for query B to break the semantic of query A, for example, query A goes through a bridge and then query B cuts the bridge, so the connectivity information is lost.
Of course we could say that such query A should be a part of a transaction, and isolation can be more strictly enforced -- but again, to what degree? Poor locality will cause the transactions to be interconnected with each other. How does a graph database determine what is the true purpose of the algorithm under each query? What does it guarantee?
Many graph databases now claims ACID, but what do they really mean?
Is it just a fancy query language over a traditional data model? Say, you could also build graph queries for a SQL database -- what does a graph database provide that such graph-over-SQL cannot?
p.s. I work on Microsoft Graph Engine: https://github.com/Microsoft/GraphEngine. We decide to build a modular graph processor rather than calling it a graph database, because we don't really know by default, what kind of semantics does a user want. With GraphEngine, you could plug in linear query languages likq Gremlin or GraphQL, you can also plug in SPARQL, or traditional relational model with strong guarantees, or down to bare-metal key-value store with atomicity and durability only. I do think that a graph data model is very helpful in many scenarios, but I think we really need to advance the research on the semantic of graph management.
"Claiming ACID" what is ambiguous about that? Transaction support with different serialization levels, like other databases that offer it.
And Neo4j originally started b/c RDBMS was not able to execute the complex deep traversals needed in real time. Dedicated storage & query engine for graphs allow you to run statements quickly that would otherwise take too long to execute.
Regarding the data model, the property-graph model is much closer to the object model but with richer relationships, it doesn't suffer from the object-rdbms impedance mismatch and is better suited to express real-world domains & scenarios. It also represents semantic relevant relationships as first class citizens in the database, allowing for proper information representation and much faster retrieval.
Disclaimer: I work with/for Neo4j, for 8+ years and still love it.
A non-graph-database would not provide operators like deep traversals. Operations are tightly bound to ACID as a whole, not just isolation. Of course ACID would always hold if you strictly linearize everything, but that defeats the purpose of data management, and one would achieve the same goal with even macro processors like `m4`.
Getting traversals and other graph algorithms into the business means that there are lot of things that should be reconsidered, like constraints, and triggers.
For example, if you cannot write a constraint to limit the local clustering coefficient of every entity, you do not proceed in your traversal with a good upper bound time budget. However, it is the vertices that you _don't_ visit that will propagate these constraints back while you are halfway there. Parallelizing such queries, in my opinion, is beyond state-of-the-art research.
You can do this with a recursive common table expression.
Look, between the database formalisms, they're all "complete" in the sense that you can choose any database and solve all the problems. But certain databases are going to be pathologically bad at solving certain types of problems, which is why there are so many sub-niches that persist over time.
For deep path traversals, you can do it with RDBMS, but a graph DB is going to win every time in part because the data structure is just set up for that purpose. There are other queries where RDBMS will be best too. So it goes.
I think that in a graph database, the ACID property comes at a greater cost, but to me it's a tool I use as a secondary store, derived from the "truth" in the RDBMS. I use it to store the data more efficiently for queries and to query data in a discovering fashion that would significantly more complicated in the underlying RDBMS.
I liked Neo4j quite a bit, it could handle all the sensor/IoT data we could throw at it. Back then it had (and I'm sure still does) a beautiful interactive data visualization dashboard, great Cypher tutorials, and more.
Neo4j is a good database. I went to write a database driver for it, and found it extraordinarily difficult. I knew it would take at least a month of work to build.
At the same time really cool tools like Firebase were becoming popular, and Multi-Master database architecture with Cassandra and Riak were showcasing what high availability could do.
So I decided, rather than implementing the Neo4j driver, which I knew was bound to Neo4j's Master-Slave architecture, I would rather switch to Firebase or build my own mashup of all the tools I wanted:
- Firebase (realtime)
- Neo4j (graphs)
- Cassandra (multi-master / P2P)
- CouchDB (offline-first)
I spent a few weeks building a prototype and submitted it to HackerNews in early 2014. It was a huge success.
Since then, we've gotten 7.5K+ stars (https://github.com/amark/gun), raised venture capital money, and introduced decentralized cryptographically secure user blockchains, and a ton more.
Graph databases, to me, are so compelling, I have not only "used them again" but spent the last 3.5+ years of my life dedicated to building, improving, and making them more awesome.
I certainly hope others try them, even if it isn't GUN. They're worth a shot, but aren't a silver bullet, so use them where it makes sense.
Hey, I noticed your nice resources in your profile - particularly Haidt. For my wife's PhD, she worked with Baumeister, a colleague of Haidt. Would love to hear more about your interest in civil discourse and other such things! Shoot me an email?
Our data set could have been handled fine with a relational database, honestly. However this was a rare case where over-engineering a problem and using the latest technology saved time.
- In 2011, it worked great on small data that fit in RAM, but once the data became bigger than RAM, queries would take unexpectedly large numbers of seconds. How much data do you put into Neo4j?
- I admired the friendly little web server until I realized that it was a massive security hole: anyone who could access it could run arbitrary code on the server, it ran over plain HTTP, and if you put it behind an HTTPS proxy, it stopped working. I hope this isn't still the case. Does it have reasonable access control and HTTPS now? Could you use the Web interface in production?
On (2), yes, the little UI now requires a username/password, and it supports HTTPS. HTTP remains available, defaulting to localhost access.
Every database benefits from having the _hot_ dataset in memory, so that's the same with Neo4j.
2011 was many years ago, since then the memory management has been completely rewritten.
You very probably wouldn't use the Neo4j Browser in production as it is meant to be a developer tool. Usually, you would build an app that uses the drivers to connect to the db.
SWE = Software engineer
DBMS = Database management system
DAU = Daily active users
It stores information as triples (Bob -> Married to -> Gary) and with properties (Bob.last_name = Stamper)
I've been finding that the benefits keep on paying off. I can arbitrary relate any thing to any other thing (and query those relationships) without changing code or database schema at all.
And the fact that it's a literal, intuitive, representation of reality makes things much easier to reason about.
When viewing something and seeing all the related info, the data nerd in me loves it: https://unlikekinds.com/t/unlike-kinds (meta)
The advantages are that it's just sql, has good performance, and we can query the graph using relational logic rather than n+1 traversal. The trade off is space (the closure table has the potential to be huge).
So it depends on the size of the data set. Part of me wishes we'd built something that's easier to partition, but for now that's a future concern.
Whether this is acceptable depends not only in the size of the data set but also how often it changes compared to how often you query it.
If we go beyond that, assuming there was one which provided great performance, data integrity and can be reliable as a primary database — then Graph DBs are just better.
First, the schema and data modeling is incredibly simple. Our minds think in graph terms. Things connecting to each other is very natural to us as human beings. Graph DBs replicate that in a very straightforward way.
Then, many graph DBs, being modern support flexible schemas, something which is a huge win for speed of application iteration.
Graph DBs are also sparse. Which means it's a lot easier to model many differents kinds of data sources and data types into the same "table." What that gives is the ability to query across anything in the entire DB, without being concerned about table level boundaries.
We were solving this problem with Google's knowledge graph where we had to fit movie dataset in DB. The film industry has so many roles (director, producer, actor, cinematographer, and so on), that having a table for each, with many times same person doing multiple roles, is just super fucking hard. With hundreds of such roles, each role being a table would be insane. Representing this information in graphs is a cakewalk in comparison. And this problem gets a lot worse if you then switch to the music industry, books and others (hence, the decision to be a knowledge "graph").
Functionality wise, graph DBs provide a super set of SQL. They support all the (equivalent of) select x from y where z type statements, while also doing fast and recursive traversals and joins at the DB level.
And recursive traversals and joins are a huge deal. The rise of GraphQL over REST APIs is in a way indicative of that. To render a page in modern websites, you need to recursively ask for components (think questions in Quora or Stack Overflow). I remember Quora would have thousands of such components on a single page. GraphQL made it easier to query for those, by expressing a way to retrieve this tree in a single query. But, the internal mechanics of doing this via relational tables is still the same, which is to repeat a query and collect cycle. Graph DBs natively support things like these, and imagine how much more efficient and powerful that is.
Once you start to wrap your head around graphs, it’s hard to not be wholeheartedly impressed by their power.
Disclaimer: I'm author of dgraph.io. But, don't let genetic fallacy blind you. My points above stem from the reasons which propelled me to jump into the graph DB world.
Do you still feel there is an advantage of graph over relational when we have a known schema and known relationships without deep recursive relationships. For example, an inventory tracking system, we have items, customers, deliveries,etc...?
I like the idea of being able to throw some metadata onto any of those tables quickly during prototyping, but my gut feeling is that long term we run into the need to be more structured and explicit like we do with a relational DB. It reminds me somewhat of the tradeoffs with NoSQL DBs during development
https://github.com/dgraph-io/graphoverflow (unmaintained, so please don't complain if it doesn't work :-)).
If you build systems like inventory tracking, question answering, etc., the hard logic of relevant data retrieval can either lie in your application or within your DB. Former is the case when you use relational DBs, latter is the case when you use graph DBs.
With graph DB, you can put the data together quickly, but then have the DB do the hardlifting of "given a customer, find me all the items and the locations of delivery" (just random Q that I spent 2 seconds, not representative of real workload); or "given a question, find me all the answers, sorted by a score; top 5 comments on these answers sorted by date, with a count of total comments, count of likes, count of dislikes, etc." (real workload for QA sites). Then the application iteration becomes largely a factor of query iteration, not backend logic iteration.
^ And that's solid! That kind of stuff is what makes developers love JS over C++ (random comparison).
Have you had success with modelling temporal data in a graph? E.g. "Bob worked for WB from 1999-2006, Disney 2006-2008, then WB again. In 2013 she transitioned and is now called Anna". Thinking of a property graph engine, both relations and properties would need to be versioned.
It feels like a graph database should be a good fit, better than a RDBMS for sure, but I've pages of sketches on how to model history and come up with nothing workable. https://arxiv.org/abs/1604.08568 is the best paper I've found; but I haven't got anything working.
My interest is as an amateur archivist, as dumping a description like the above into a text field and displaying it back to people is less useful than being able to query it or show the changes over time. Especially when you want to link it with files or media for retrieval purposes.
Bob -worked_at-> Work node
Work node -from-> date
Work node -to-> date
Work node -employer-> WB, Disney, etc.
Then each instance of Bob working would be a node in the graph. Note that (and this might be counterintuitive) this is the same as how you'd represent marriage data as well.
These intermediate nodes are the only complexity that one has to think about in a graph model (even then it's not that complex compared to thinking through how 20 different tables are connected). Rest is easy peasy.
If the situation calls for it, sure! The current use case is sort of up in the air. The decision was made to use a graph database to store the mutation of records over time, but then the higher ups want to limit what's put in it, so... I'm not sure if the computation costs are worth what it's actually capable of. From what I'm gathering, if one is looking to store complex data, which is highly connected through edge-case relationships, as in, greater than 5 types of edges, then it might be worth looking into, but I can't imagine how a dynamic/traditional table database wouldn't have been faster in terms of what we need it for, querying large lists of data, with 3-4 edge traversals.
I only have a year~+ of experience with databases, so grain of salt. In terms of personal preference, working with a graph database has been quite fun.
I really like Gremlin and I like how you can extend the relations and do new computations you never thought of easily, but it's not the savior it's been hailed as, in my opinion. For a lot of problems SQL will do you well, and migrating can be a bitch with SQL but if it's a domain where the basic functionality is solved (such as a web shop) I wouldn't bother with a graph database until i find a good use case for it. You can always migrate your SQL tables to a graph DB later on if you think it's worth it.
Honestly, a graph database made discussions with the domain expains MUCH easier. And the schemalessness made evolutions much easier. Our technical team embraced the concept really quickly. And the domain experts have a clean mental model of the data that were otherwise split between very unfriendly technologies (XML databases, files, CSV).
We wonder whether Elasticsearch could be removed from our architecture (because managing 2 databases is a mess). But we do not know yet if Neo4J can handle both the load and the variety of search use-cases.
As for me, for decades I've wanted to be able to have everything stored on my computer represented as a graph. (Times have changed, so there's obviously a strong network-connected aspect now.)
The variety of DBs available if you use RDF is great as well. Different DBs have different strong sides but we can keep the same data model and query language.
It's also based on a small HTTP server: http://github.com/tinspin/rupy
I will use it for the rest of my life in every project that needs a relations.
(I'm using pyDatalog, which is open source and works with a variety of database backends.)
But it seems likely that for queries that differentiate graph databases, such as finding long, variable-length paths, that there are cases where they can excel.
I work with large amounts of geographic data. We use Cassandra and RDBMS as the traditional storage but whenever we want to do network analysis it goes into graph DB just to take advantage of the tooling.
And one of our use cases is exactly what you mention. If you are interested in the properties of the edges of long highways in a road network that can stretch hundreds of edges, for instance, RDBMS ain’t gonna cut it.
It was a wild ride. At the time I started there was little to no tooling, only few SPARQL implementations and SPARQL 1.1 was not released yet. It was PITA to use it but it still stuck with me: I finally had an agile data model that allowed me and our customers to grow with the problem. I was quite sceptical if that would ever scale but I still didn't stop using it.
Initially one can be overwhelmed by RDF: It is a very simple data model but at the same time it's a technology stack that allows you to do a lot of crazy stuff. You can describe semantics of the data in vocabularies and ontologies, which you should share and re-use, you can traverse the graph with its query language SPARQL and you have additional layers like reasoning that can figure out hidden gems in your data and make life easier when you consume or validate it. And most recently people started integrating machine learning toolkits into the stack so you can directly train models based on your RDF knowledge graph.
If you want to solve a small problem RDF might not be the most logical choice at first. But then you start thinking about it again and you figure out that this is probably not the end of it. Sure, maybe you would be faster by using the latest and greatest key/value DB and hack some stuff in fancy web frameworks. But then again there is a fair chance the customer wants you to add stuff in the future and you are quite certain that at one point it will blow up because the technology could not handle it anymore.
That will not happen with RDF. You will have to invest more time at first, you will talk about things like semantics of your customers data and you will spend quite some time figuring out how to create identifiers (URIs in RDF) that are still valid in years from now. You will have a look at existing vocabularies and just refine things that are really necessary for the particular use case. You will think about integrating data from relational systems, Excel files or JSON APIs by mapping them to RDF, which again is all defined in W3C standards. You will mock-up some data in a text editor written in your favourite serialization of RDF. Yes, there are many serializations available and you should most definitely throw away and book/text that starts with RDF/XML, use Turtle or JSON-LD instead, whatever fits you best.
After that you start automating everything, you write some glue-code that interprets the DSL you just built on top of RDF and appropriate vocabularies and you start to adjust everything to your customer's needs. Once you go live it will look and feel like any other solution you built before but unlike those, you can extend it easily and increase its complexity once you need it.
And at that point you realize that this is all worth is and you will most likely not touch any other technology stack anymore. At least that's what I did.
I could go on for a long time, in fact I teach this stack in companies and gov-organizations during several days and I can only scratch the surface of what you can do with it. It does scale, I'm convinced by that by now and the tooling is getting better and better.
If you are interested start having a look at the Creative Commons course/slides we started building. There is still lots of content that should be added but I had to start somewhere: http://linked-data-training.zazuko.com/
Also have a look at Wikipedia for a list of SPARQL implementations: https://en.wikipedia.org/wiki/Comparison_of_triplestores
Would I use other graph databases? Definitely not. The great thing about RDF is that it's open, you can cross-reference data across silos/domains and profit from work others did. If I create another silo in a proprietary graph model, why would I bother?
Let me finish with a quote from Dan Brickly (Googles schema.org) and Libby Miller (BBC) in a recent book about RDF validation:
> People think RDF is a pain because it is complicated. The truth is even worse. RDF is painfully simplistic, but it allows you to work with real-world data and problems that are horribly complicated. While you can avoid RDF, it is harder to avoid complicated data and complicated computer problems.
I could not have come up with a better conclusion.