Hacker News new | comments | show | ask | jobs | submit login
Ask HN: If you've used a graph database, would you use it again?
189 points by networked 3 months ago | hide | past | web | favorite | 82 comments

I used Neo4j for a few side projects but my go-to is still PostgreSQL. The largest flaw I see with Neo4j (and probably other graph databases as well) is that it forces you to think of your entities as either vertices or edges and that line tends to be not as clear as you might expect.

For example: (:Person)-[:BEFRIENDS]->(:Person)

If we'd want to store a date with that relationship, Neo4j got your back, that's entirely possible (= relationships can have attributes). But now our requirements change and we'd also want to have an entity for Events shared by friends (e.g. friendship anniversary), now we have to remodel our data to something like:


In SQL that wouldn't have been a remodel, because there's no difference between a relationship and an entity. We would've gone from:

Person(id) Friendship(person1_id, person2_id)


Person(id) Friendship(person1_id, person2_id) Event(friendship_id)

So I feel like the vertice/edge distinction Neo4j makes, gets in the way of changing data model needs and I ultimately think that modeling your data as a graph is not helpful. Though it can be extremely helpful in querying and that's where its biggest strength lies.

The problem you're describing is mostly attributable to property graph stores, and doesn't apply to named graph engines.

This is the essential difference between neo4j (a property graph engine) and most RDF stores which support quads (i.e. <subject> <predicate> <object> <context>)

Aren’t RDF stores synonymous with triple stores?

RDF is the serialization format for triples. The triple store is the database that stores triples.

I see what you mean. And yet strangely, the need for that "Friendship" node can also be seen as a strength. How else would you assert metadata about that thing otherwise?

If A and B share an event via a friendship, you can keep track of things about that. Granted in an RDBMS if all you wanted was to draw the line then you could do it with an extra FK, but I think the conclusion you're drawing is going too far, specifically:

> In SQL that wouldn't have been a remodel, because there's no difference between a relationship and an entity

There is a difference in SQL; relationships are EITHER more columns, OR a join to another table, both are possible. In graphs, "hyper relationships" (e.g. relating more than 2 things) require another node, but this is apples/oranges comparison.

I probably should've clarified that I was talking about n:m-relationships. And for that case I don't see how it would've been an apples/oranges comparison.

Maybe you want a hypergraph?


I wish this were a networked database that I could just slap on a Digital Ocean droplet and query from anywhere.

Interesting, never heard of the concept. Thank you, it seems to indeed solve that exact problem.

I hope something like this becomes as mature and usable as Neo4j.

You need a hypergraph database!

Seems like a hybrid datastore is ideal then.

I've been using RDF and triplestores / RDF databases for the last half-decade, developing both front-end and back-end systems, and training many developers to work in RDF. If you're used to either relational databases or object-oriented design, it's a really different way of thinking about data. Just like OOP is really good for certain kinds of problems and models, and RDBMS is good for other kinds of problems and models, RDF is great for specific kinds of problems. For example, if you need to combine several data from several somewhat incongruent sources into a single coherent database (e.g. dozens of data feeds that are almost the same, but you need to preserve the differences while combining the parts that are the same), you might end up with headaches trying to come up with a good RDBMS design; RDF is really well-suited for that kind of problem.

While a lot of the work I do is covered by NDA, one problem that I've applied it that I can talk about is analyzing basketball play-by-plays. I've spent some time talking to the analytics team at an NBA franchise, and it turns out doing interesting analytics on play-by-plays can be a surprisingly tough nut to crack. RDF was a great tool for tackling this. Here's the source (written in Scala), for anyone interested at having a look: https://github.com/andrewstellman/pbprdf

This is interesting, I always wanted to find a good use case for RDF but in the end RDBMS worked out fine.

Sports events seem a good example. What made it easier for you in your example of basketball play-by-plays with RDF?

Taking the first example https://github.com/andrewstellman/pbprdf#example-analyze-a-s... and translating it into an RDBMS approach seems rather straight forward:

  GameEvent (PersonA, EventType, PersonB, Game, Time)
  Roster (Person, Game, Team)
To get the fouls drawn you then take

  SELECT Team, COUNT(Team)
  FROM GameEvent
  JOIN Roster ON GameEvent.Game = Roster.Game AND GameEvent.PersonB = Roster.Person
  WHERE GameEvent.EventType = "foul"
  GROUP BY Roster.Team
Is it because of easier schema changes later on like introducing "secondsLeftInPeriod"? I suppose in a normalized relational scenario that could be something like:

  Period (Number, Game, StartTime, EndTime)
And doing

  .. WHERE Period.EndTime - GameEvent.Time < 5 ..

> Is it because of easier schema changes later on like introducing "secondsLeftInPeriod"?

That's definitely an important benefit. RDF makes it really easy to introduce changes that not only don't break the existing schema, but can be entirely isolated or combined in queries.

Another thing that RDF makes easy is analysis that takes advantage of a graph -- using relationships with other players, shots, etc. What players have the highest percentage making 3-point shots in possessions immediately after a player on the other team missed a 3-point shot? Building queries you compare previous possessions, shots, quarters; players' relations to each other (e.g. performance players who were subbed in after previous teammates went scoreless for 3 possessions) -- these things are a lot easier to do in RDF than in SQL.

Obviously, there are many things that are easier to model in RDBMS and query in SQL than with RDF/SPARQL. Every tool has its uses.

Do you know if there's off-the-shelf software (GUI) to create/edit/explore your own RDF dataset? Or does it always involve building your own front-end?

I like WebVOWL for visualizing the RDF ontology. Here's the pbprdf ontology displayed in it: http://www.visualdataweb.de/webvowl/#iri=https://raw.githubu...

A few years ago I put together a quick GUI in C# to make it easier to run SPARQL queries: https://github.com/andrewstellman/sparql-explorer

I haven't found an RDF editor or visual tool that I like. Some people like Topbraid Composer: https://www.topquadrant.com/tools/modeling-topbraid-composer... (commercial, closed source)

May be you can have a look at Datao: http://datao.net.

PS: I am the maintainer, and we currently have some certificate issues. Drop me a mail to datao@datao.net if you want to be notified when the problem is solved.

Take a look at http://ontodia.org

Is there a particular RDF store which you would recommend? It seems scaling is a bit of an issue with Apache Jena. It's very easy to bring to entire system to a crawl with certain sparql queries and enough data in the store.

Semi-related, is there a good in-browser RDF store? Say something like what PouchDB does for CouchDB (and similar) JSON document stores.

For as important as RDF has been for the web, it feels increasingly less "web native" today as most efforts still seem to be highly Java-focused and Browsers mostly don't run Java anymore.

(I ask because a silly project idea I have some tiny amount of notes for is something of a Twine competitor. I realized that while the language ideas I'm exploring don't look like SPARQL or other graph languages, there is a bit of an overlap conceptually under the hood and an RDF store might make sense as a bootstrap tech, but brief searches didn't turn up anything useful.)

At the moment, I use N3.js for its N3.parse() function. It builds a graph of Javascript objects in memory from a RDF string, just like a JSON.parse() builds a tree of Javascript objects in memory from a JSON string.

Note: I have never understood the need for a DB on the client-side.

In the case of a Twine-like system to make offline capable games: everything is just HTML/CSS/JS, it often gets packed into a single HTML file, and there is no server side at all. A client-side DB would be important in that case so that you can query/update game state.

In another thread GUN is mentioned, at that may be the closest client-side DB to what I think I'm looking for, if I ever get around to that side project.

This is probably the closest: http://linkeddatafragments.org/

A team I've been working with for years has had a lot of success with Blazegraph: https://www.blazegraph.com/ -- the marketing materials call it "ultra-scalable," and that actually turns out to be true. I haven't seen too many cases where specific queries will cause serious performance with the system.

That said, we've done some work to prevent runaway queries (e.g. strict query timeouts, downstream systems that handle that situation gracefully).

I believe BlazeGraph is the basis for Amazon Neptune, the graph DB AWS announced at ReInvent last year. https://aws.amazon.com/neptune/

That's great! I've had a chance to talk to some of the engineering and management folks at BlazeGraph over the years, and they're a really solid group. It's really nice to hear about a deserving team get success.

We had a production Rails app running with postgres, and we decided to implement some of our models with Neo4j. Graphs felt like the right way to represent the data, and all of the models were new, so we felt more free to choose the approach that seemed best.

A month later we rewrote everything in SQL - the main drivers were:

- as we refined our model, we realized that a relational DB with a bunch of join tables was good enough

- our developers were more comfortable working with SQL

- it wasn't possible to run complicated queries involving both databases simultaneously

- the Rails ORM felt easier to use than the Neo4j Ruby APIs (though this was certainly a function of our own familiarity with Rails and relational databases in general)

- having the extra database complicated our codebase and complicated our deployment

There was nothing horrifying or surprising in our encounter with graph databases. It just felt like we just made the wrong initial architectural decision. We were still trying to define the problem and were trying to use something we didn't fully understand.

I'd hesitate to use graph dbs in the future unless I needed a high-performance app with a lot of data that only a graph could model well. Otherwise having two different types of databases is annoying.

My instinct is that using two different databases in one app adds a lot of complexity, but it's not clear to me that if you need to pick one database that SQL is a better choice than a graph database. It really depends on the application. For many purposes, a graph database can do what a SQL database can do, because a SQL TABLE is very much like a graph collection of nodes. Of course, if you're starting with an app on a SQL database and just adding to it, that's very different from a "green field" project where you can pick technologies freely.

I started using Neo4j 8 years ago after a long time as a relational database developer. I needed it for a project building a LinkedIn clone with skills (at the time LinkedIn didn't have skills). I was going to need a massive join table of user-skill-user and decided it was best in a graph. I built a ruby gem "neography" as a Neo4j driver and became an open source contributor. Later Neo4j contracted me to build a rules engine in a week for one of their clients. That got me a job as a Sales Engineer at Neo4j. 100+ blog posts, 200+ github repos later, lots of travel, and many wins, I still love the job, and still love the database.

I have been looking at learning graph database architecture recently and your blog posts keep coming. Great stuff! Keep it coming

How is a graph database built under the hood?

I know that a decent RDBMS (simplified) will consist of the following:

- data in blocks organised with a block-size that the underlying filesystem likes

- a cache for the most frequently used blocks

- every index is a B-Tree with pointers to the blocks containing the tuples

Then there are column stores as well as row stores, and for compression you might have some dictionary encoding going on.

Now, how does the Graph Database look under the hood and what are the complexities involved? How is the Graph persisted?

Graph databases are built a lot of different ways; for example Neo4j's architecture is very, very different than something like an RDF triple store, or datastax on top of cassandra.

[Neo4j internals can be seen here](https://www.slideshare.net/thobe/an-overview-of-neo4j-intern...)...it's a bit old but I think mostly still accurate.

In graphs you have to persist nodes and edges, though you may partition nodes by label/category. In the case of neo4j there is a property store rather than a set of columns.

Thanks, very helpful. I am just looking at it and will have a bit of a think about this later :)

A graph database is similar, only it uses direct-record-ids for linking connected entities and not indexes. So instead of doing joins on indexes it follows record-pointers during graph traversals.

In the graph databases book (graphdatabases.com) there is a chapter on the internal architecture of Neo4j.

Interesting question. I think it is critical to point out that, the underlying principles for a graph database is different from RDBMS, because the operators in a graph database may not comply to relational algebra.

Consider the following case:

Client A issues a query -- starting from a vertex, conduct bounded closure search, giving every visited vertex a mark (coloring, or lexical flag, whatever you would expect from a graph algorithm)

Client B issues a query -- clearing any marks applied to a particular vertex, which happens to be one of the visited vertex of Client A's query.

Now, race condition aside, let's assume we first process query A then B. Would we allow query B to succeed? It is clearly possible for query B to break the semantic of query A, for example, query A goes through a bridge and then query B cuts the bridge, so the connectivity information is lost.

Of course we could say that such query A should be a part of a transaction, and isolation can be more strictly enforced -- but again, to what degree? Poor locality will cause the transactions to be interconnected with each other. How does a graph database determine what is the true purpose of the algorithm under each query? What does it guarantee?

Many graph databases now claims ACID, but what do they really mean?

Is it just a fancy query language over a traditional data model? Say, you could also build graph queries for a SQL database -- what does a graph database provide that such graph-over-SQL cannot?

p.s. I work on Microsoft Graph Engine: https://github.com/Microsoft/GraphEngine. We decide to build a modular graph processor rather than calling it a graph database, because we don't really know by default, what kind of semantics does a user want. With GraphEngine, you could plug in linear query languages likq Gremlin or GraphQL, you can also plug in SPARQL, or traditional relational model with strong guarantees, or down to bare-metal key-value store with atomicity and durability only. I do think that a graph data model is very helpful in many scenarios, but I think we really need to advance the research on the semantic of graph management.

Transaction isolation is a no-brainer, so I don't think your example holds. Also, your example is not related to the algebra but to isolation.

"Claiming ACID" what is ambiguous about that? Transaction support with different serialization levels, like other databases that offer it.

And Neo4j originally started b/c RDBMS was not able to execute the complex deep traversals needed in real time. Dedicated storage & query engine for graphs allow you to run statements quickly that would otherwise take too long to execute.

Regarding the data model, the property-graph model is much closer to the object model but with richer relationships, it doesn't suffer from the object-rdbms impedance mismatch and is better suited to express real-world domains & scenarios. It also represents semantic relevant relationships as first class citizens in the database, allowing for proper information representation and much faster retrieval.

Disclaimer: I work with/for Neo4j, for 8+ years and still love it.

> "Claiming ACID" what is ambiguous about that? Transaction support with different serialization levels, like other databases that offer it.

A non-graph-database would not provide operators like deep traversals. Operations are tightly bound to ACID as a whole, not just isolation. Of course ACID would always hold if you strictly linearize everything, but that defeats the purpose of data management, and one would achieve the same goal with even macro processors like `m4`.

Getting traversals and other graph algorithms into the business means that there are lot of things that should be reconsidered, like constraints, and triggers.

For example, if you cannot write a constraint to limit the local clustering coefficient of every entity, you do not proceed in your traversal with a good upper bound time budget. However, it is the vertices that you _don't_ visit that will propagate these constraints back while you are halfway there. Parallelizing such queries, in my opinion, is beyond state-of-the-art research.

> A non-graph-database would not provide operators like deep traversals

You can do this with a recursive common table expression.

While this is technically true, in the SQL world this requires wizard level skills that most SQL developers do not possess, and when you arrive at this spot, you end up with a query that performs really, really badly.

Look, between the database formalisms, they're all "complete" in the sense that you can choose any database and solve all the problems. But certain databases are going to be pathologically bad at solving certain types of problems, which is why there are so many sub-niches that persist over time.

For deep path traversals, you can do it with RDBMS, but a graph DB is going to win every time in part because the data structure is just set up for that purpose. There are other queries where RDBMS will be best too. So it goes.

Recursive CTEs are breadth-first, which may not always be what you need.

ACID is whatever you define. In a RDBMS you can make a query returning dirty data, or that data is partially saved if an error happens. Or you can roll it into a full ACID compliant query/transaction.

I think that in a graph database, the ACID property comes at a greater cost, but to me it's a tool I use as a secondary store, derived from the "truth" in the RDBMS. I use it to store the data more efficiently for queries and to query data in a discovering fashion that would significantly more complicated in the underlying RDBMS.

this is a primary scenario for us -- knowledge graph streaming into the system, and graph queries run with private working sets, without touching the core data.

How mature is GE? Who's using it in production?

I wrote a driver for MongoDB in 2010, but then moved onto Neo4j in late 2013.

I liked Neo4j quite a bit, it could handle all the sensor/IoT data we could throw at it. Back then it had (and I'm sure still does) a beautiful interactive data visualization dashboard, great Cypher tutorials, and more.

Neo4j is a good database. I went to write a database driver for it, and found it extraordinarily difficult. I knew it would take at least a month of work to build.

At the same time really cool tools like Firebase were becoming popular, and Multi-Master database architecture with Cassandra and Riak were showcasing what high availability could do.

So I decided, rather than implementing the Neo4j driver, which I knew was bound to Neo4j's Master-Slave architecture, I would rather switch to Firebase or build my own mashup of all the tools I wanted:

- Firebase (realtime)

- Neo4j (graphs)

- Cassandra (multi-master / P2P)

- CouchDB (offline-first)

I spent a few weeks building a prototype and submitted it to HackerNews in early 2014. It was a huge success.

Since then, we've gotten 7.5K+ stars (https://github.com/amark/gun), raised venture capital money, and introduced decentralized cryptographically secure user blockchains, and a ton more.

Graph databases, to me, are so compelling, I have not only "used them again" but spent the last 3.5+ years of my life dedicated to building, improving, and making them more awesome.

I certainly hope others try them, even if it isn't GUN. They're worth a shot, but aren't a silver bullet, so use them where it makes sense.

Great pitch, I have been on the lookout for a proper offline-first realtime data connection with some kind of conflict resolution. It seems to me like such a natural combination of functions, but I haven't come across any tool that has it all yet. Will definitely check this out!

What’s gun, you mean gnu open source ?

Likely they're referring to https://github.com/amark/gun , which they refer to earlier. amark is Mark Nadal.

Thanks, yes (you and xumingmingv) are correct.

Hey, I noticed your nice resources in your profile - particularly Haidt. For my wife's PhD, she worked with Baumeister, a colleague of Haidt. Would love to hear more about your interest in civil discourse and other such things! Shoot me an email?


their project name is gun: https://github.com/amark/gun

The coolest thing to me about neo4j is that it spins up a little web server with an extremely friendly UI that allows people to build queries and run then locally. My non-coder coworker wrote all her own queries and found, then fixed errors in the data entirely on her own.

Our data set could have been handled fine with a relational database, honestly. However this was a rare case where over-engineering a problem and using the latest technology saved time.

So I used Neo4j in 2011. It was very exciting at first, and then I got quite burned by it when I tried to make something real. Many people in this thread are describing a very different experience, and I want to know if it has really dramatically improved, or if the use cases of Neo4j users are just different from mine.

- In 2011, it worked great on small data that fit in RAM, but once the data became bigger than RAM, queries would take unexpectedly large numbers of seconds. How much data do you put into Neo4j?

- I admired the friendly little web server until I realized that it was a massive security hole: anyone who could access it could run arbitrary code on the server, it ran over plain HTTP, and if you put it behind an HTTPS proxy, it stopped working. I hope this isn't still the case. Does it have reasonable access control and HTTPS now? Could you use the Web interface in production?

On (1), the memory layer has been entirely rewritten since 2011. It used to be a combination of MMAP and on-heap caching; mmap in java being notoriously terrible and caching on the heap being even worse. The memory layer now works similar to postgres, with a user-space page cache managing blocks of RAM. So: It's certainly changed, and in my experience much for the better.

On (2), yes, the little UI now requires a username/password, and it supports HTTPS. HTTP remains available, defaulting to localhost access.

It runs authenticated enabled by default and uses https and also our binary (always-TLS) protocol.

Every database benefits from having the _hot_ dataset in memory, so that's the same with Neo4j.

2011 was many years ago, since then the memory management has been completely rewritten. You very probably wouldn't use the Neo4j Browser in production as it is meant to be a developer tool. Usually, you would build an app that uses the drivers to connect to the db.

Facebook uses a custom graph database called TAO (nodes, edges, traverse them [1]) for storing (almost) all production data. Based on DBMS classes from the Uni days this is counterintuitive, but <scale>. In practice it just worked, and it didn't keep / enabled SWEs to move fast. Having said that I don't see why I would use a graph database unless I have >10M DAUs.

[1] https://www.facebook.com/notes/facebook-engineering/tao-the-...

TAO is probably better described as a custom graph cache. It understands the data model and can do certain operations (especially stuff like set intersections), but the authoritative store for the data's still MySQL

Does that mean Daily Active Users? Please spell out abbreviations!

TAO = The Associations and Objects (from the linked paper)

SWE = Software engineer

DBMS = Database management system

DAU = Daily active users

I'm currently using a home-baked graph database built on top of PostgreSQL for https://unlikekinds.com

It stores information as triples (Bob -> Married to -> Gary) and with properties (Bob.last_name = Stamper)

I've been finding that the benefits keep on paying off. I can arbitrary relate any thing to any other thing (and query those relationships) without changing code or database schema at all.

And the fact that it's a literal, intuitive, representation of reality makes things much easier to reason about.

When viewing something and seeing all the related info, the data nerd in me loves it: https://unlikekinds.com/t/unlike-kinds (meta)

I used the graph part of ArangoDB for a recent project and I appreciated the flexible nature of the relationships between entities (being edges, so always n-n). For example, my customer did often change its mind about some critical parts of the business logic (and thus relations between entities) and it was a pleasure to update without rewriting too much code. Also the queries involving many relationships seems more powerful and simpler than in a RDBMS. Anyway, maybe not an universal solution but, as a web/mobile developper, I can't see the actual limitations with my daily use case.

Our app models directed graphs in Postgres with a closure table (the transitive edges between nodes).

The advantages are that it's just sql, has good performance, and we can query the graph using relational logic rather than n+1 traversal. The trade off is space (the closure table has the potential to be huge).

So it depends on the size of the data set. Part of me wishes we'd built something that's easier to partition, but for now that's a future concern.

Surely another disadvantage of storing all that redundant data is that either (a) you lose atomicity of updates or (b) have some terrifyingly large transactions for a relatively simple change to the underlying data.

Whether this is acceptable depends not only in the size of the data set but also how often it changes compared to how often you query it.

There is a lot of stigma attached to graph DBs. Would it provide good performance? Should I ever use it as my primary database? Is my data ever safe with a graph DB?

If we go beyond that, assuming there was one which provided great performance, data integrity and can be reliable as a primary database — then Graph DBs are just better.

First, the schema and data modeling is incredibly simple. Our minds think in graph terms. Things connecting to each other is very natural to us as human beings. Graph DBs replicate that in a very straightforward way.

Then, many graph DBs, being modern support flexible schemas, something which is a huge win for speed of application iteration.

Graph DBs are also sparse. Which means it's a lot easier to model many differents kinds of data sources and data types into the same "table." What that gives is the ability to query across anything in the entire DB, without being concerned about table level boundaries.

We were solving this problem with Google's knowledge graph where we had to fit movie dataset in DB. The film industry has so many roles (director, producer, actor, cinematographer, and so on), that having a table for each, with many times same person doing multiple roles, is just super fucking hard. With hundreds of such roles, each role being a table would be insane. Representing this information in graphs is a cakewalk in comparison. And this problem gets a lot worse if you then switch to the music industry, books and others (hence, the decision to be a knowledge "graph").

Functionality wise, graph DBs provide a super set of SQL. They support all the (equivalent of) select x from y where z type statements, while also doing fast and recursive traversals and joins at the DB level.

And recursive traversals and joins are a huge deal. The rise of GraphQL over REST APIs is in a way indicative of that. To render a page in modern websites, you need to recursively ask for components (think questions in Quora or Stack Overflow). I remember Quora would have thousands of such components on a single page. GraphQL made it easier to query for those, by expressing a way to retrieve this tree in a single query. But, the internal mechanics of doing this via relational tables is still the same, which is to repeat a query and collect cycle. Graph DBs natively support things like these, and imagine how much more efficient and powerful that is.

Once you start to wrap your head around graphs, it’s hard to not be wholeheartedly impressed by their power.

Disclaimer: I'm author of dgraph.io. But, don't let genetic fallacy blind you. My points above stem from the reasons which propelled me to jump into the graph DB world.

The film industry example makes sense. Partially duplicative tables make development confusing.

Do you still feel there is an advantage of graph over relational when we have a known schema and known relationships without deep recursive relationships. For example, an inventory tracking system, we have items, customers, deliveries,etc...? I like the idea of being able to throw some metadata onto any of those tables quickly during prototyping, but my gut feeling is that long term we run into the need to be more structured and explicit like we do with a relational DB. It reminds me somewhat of the tradeoffs with NoSQL DBs during development

Not only do I think there's a lot of benefit in representing data for which the schema is already fixed, we've actually gone to the extent of showing this by building a whole replica website for Stack Overflow.

https://github.com/dgraph-io/graphoverflow (unmaintained, so please don't complain if it doesn't work :-)).

If you build systems like inventory tracking, question answering, etc., the hard logic of relevant data retrieval can either lie in your application or within your DB. Former is the case when you use relational DBs, latter is the case when you use graph DBs.

With graph DB, you can put the data together quickly, but then have the DB do the hardlifting of "given a customer, find me all the items and the locations of delivery" (just random Q that I spent 2 seconds, not representative of real workload); or "given a question, find me all the answers, sorted by a score; top 5 comments on these answers sorted by date, with a count of total comments, count of likes, count of dislikes, etc." (real workload for QA sites). Then the application iteration becomes largely a factor of query iteration, not backend logic iteration.

^ And that's solid! That kind of stuff is what makes developers love JS over C++ (random comparison).

> We were solving this problem with Google's knowledge graph where we had to fit movie dataset in DB. The film industry has so many roles (director, producer, actor, cinematographer, and so on), that having a table for each, with many times same person doing multiple roles, is just super fucking hard. With hundreds of such roles, each role being a table would be insane. Representing this information in graphs is a cakewalk in comparison. And this problem gets a lot worse if you then switch to the music industry, books and others (hence, the decision to be a knowledge "graph").

Have you had success with modelling temporal data in a graph? E.g. "Bob worked for WB from 1999-2006, Disney 2006-2008, then WB again. In 2013 she transitioned and is now called Anna". Thinking of a property graph engine, both relations and properties would need to be versioned.

It feels like a graph database should be a good fit, better than a RDBMS for sure, but I've pages of sketches on how to model history and come up with nothing workable. https://arxiv.org/abs/1604.08568 is the best paper I've found; but I haven't got anything working.

My interest is as an amateur archivist, as dumping a description like the above into a text field and displaying it back to people is less useful than being able to query it or show the changes over time. Especially when you want to link it with files or media for retrieval purposes.

Yeah, we were doing that with intermediate nodes. So,

Bob -worked_at-> Work node Work node -from-> date Work node -to-> date Work node -employer-> WB, Disney, etc.

Then each instance of Bob working would be a node in the graph. Note that (and this might be counterintuitive) this is the same as how you'd represent marriage data as well.

These intermediate nodes are the only complexity that one has to think about in a graph model (even then it's not that complex compared to thinking through how 20 different tables are connected). Rest is easy peasy.

Thanks, going to have a go at re-sketching with intermediate nodes!

Currently, using Gremlin on AWS Neptune. The learning curve was steep.

If the situation calls for it, sure! The current use case is sort of up in the air. The decision was made to use a graph database to store the mutation of records over time, but then the higher ups want to limit what's put in it, so... I'm not sure if the computation costs are worth what it's actually capable of. From what I'm gathering, if one is looking to store complex data, which is highly connected through edge-case relationships, as in, greater than 5 types of edges, then it might be worth looking into, but I can't imagine how a dynamic/traditional table database wouldn't have been faster in terms of what we need it for, querying large lists of data, with 3-4 edge traversals.

I only have a year~+ of experience with databases, so grain of salt. In terms of personal preference, working with a graph database has been quite fun.

We use Titan DB and are updating it to Janus since Titan is dead. We used Neo4j for a small hack at an event at the office mapping up bus routes and homes living along the bus stops to find out quickly how accessible they were, and it worked well ... for a hack.

I really like Gremlin and I like how you can extend the relations and do new computations you never thought of easily, but it's not the savior it's been hailed as, in my opinion. For a lot of problems SQL will do you well, and migrating can be a bitch with SQL but if it's a domain where the basic functionality is solved (such as a web shop) I wouldn't bother with a graph database until i find a good use case for it. You can always migrate your SQL tables to a graph DB later on if you think it's worth it.

We have a production system based on Neo4J (and Elasticsearch). We have had some hard times with Neo4J (cluster issues, deadlocks) but their support helped us figure out those issues.

Honestly, a graph database made discussions with the domain expains MUCH easier. And the schemalessness made evolutions much easier. Our technical team embraced the concept really quickly. And the domain experts have a clean mental model of the data that were otherwise split between very unfriendly technologies (XML databases, files, CSV).

We wonder whether Elasticsearch could be removed from our architecture (because managing 2 databases is a mess). But we do not know yet if Neo4J can handle both the load and the variety of search use-cases.

I've used TitanDB multiple times for non-production projects, and would absolutely use it again. It runs in process on the JVM, and models true graph problems well. (dependencies in large code bases, social networks)

What's your use case?

As for me, for decades I've wanted to be able to have everything stored on my computer represented as a graph. (Times have changed, so there's obviously a strong network-connected aspect now.)

Helped build one on top of cassandra at past employer - they’re still apparently happy with it. It’s something like 5-6 petabytes and handles millions of writes per second, hundreds of thousands (or maybe millions) of reads/traversals per second. Powers a successful and growing APT-hunting SaaS platform, and I’m still pretty proud of it, even though I don’t work there anymore.

Yes, using RDF+SPARQL I would use it any day of the week. The power off RDF comes when having to deal with other peoples data, or providing your data to other people. This is not a usecase everyone has but if you do nothing beats RDF+SPARQL in a financial sense.

The variety of DBs available if you use RDF is great as well. Different DBs have different strong sides but we can keep the same data model and query language.

I made this: http://root.rupy.se

It's also based on a small HTTP server: http://github.com/tinspin/rupy

I will use it for the rest of my life in every project that needs a relations.

I am finding Datalog* more general and easier. It lets you write rules declaratively rather than procedurally. I don't know how its performance compares to a traditional graph database, but it sure saves a lot of programming time.

(I'm using pyDatalog, which is open source and works with a variety of database backends.)

Are GraphDBs ever the right move in terms of performance?

For things that can reasonably be done with RDBMS, probably not... see "Do we need specialized graph databases? Benchmarking real-time social networking applications" (PDF link https://event.cwi.nl/grades/2017/12-Apaci.pdf)

But it seems likely that for queries that differentiate graph databases, such as finding long, variable-length paths, that there are cases where they can excel.

I found the same. If you live in a very specific use case they excel.

I work with large amounts of geographic data. We use Cassandra and RDBMS as the traditional storage but whenever we want to do network analysis it goes into graph DB just to take advantage of the tooling.

And one of our use cases is exactly what you mention. If you are interested in the properties of the edges of long highways in a road network that can stretch hundreds of edges, for instance, RDBMS ain’t gonna cut it.

thanks for that link!

I'm an engineer that used to do RDBs for a long time. One day a customer of a friend came with an issue that was in my opinion impossible to solve with relational DBs: He described data that is in flow all the time and there was no way we could come up with a schema that would fit his problem for more than one month after we finished it. Then I remembered that another friend once mentioned this graph model called RDF and its query language SPARQL and started digging into it. It's all W3C standards so it's very easy to read into it and there are competing implementations.

It was a wild ride. At the time I started there was little to no tooling, only few SPARQL implementations and SPARQL 1.1 was not released yet. It was PITA to use it but it still stuck with me: I finally had an agile data model that allowed me and our customers to grow with the problem. I was quite sceptical if that would ever scale but I still didn't stop using it.

Initially one can be overwhelmed by RDF: It is a very simple data model but at the same time it's a technology stack that allows you to do a lot of crazy stuff. You can describe semantics of the data in vocabularies and ontologies, which you should share and re-use, you can traverse the graph with its query language SPARQL and you have additional layers like reasoning that can figure out hidden gems in your data and make life easier when you consume or validate it. And most recently people started integrating machine learning toolkits into the stack so you can directly train models based on your RDF knowledge graph.

If you want to solve a small problem RDF might not be the most logical choice at first. But then you start thinking about it again and you figure out that this is probably not the end of it. Sure, maybe you would be faster by using the latest and greatest key/value DB and hack some stuff in fancy web frameworks. But then again there is a fair chance the customer wants you to add stuff in the future and you are quite certain that at one point it will blow up because the technology could not handle it anymore.

That will not happen with RDF. You will have to invest more time at first, you will talk about things like semantics of your customers data and you will spend quite some time figuring out how to create identifiers (URIs in RDF) that are still valid in years from now. You will have a look at existing vocabularies and just refine things that are really necessary for the particular use case. You will think about integrating data from relational systems, Excel files or JSON APIs by mapping them to RDF, which again is all defined in W3C standards. You will mock-up some data in a text editor written in your favourite serialization of RDF. Yes, there are many serializations available and you should most definitely throw away and book/text that starts with RDF/XML, use Turtle or JSON-LD instead, whatever fits you best.

After that you start automating everything, you write some glue-code that interprets the DSL you just built on top of RDF and appropriate vocabularies and you start to adjust everything to your customer's needs. Once you go live it will look and feel like any other solution you built before but unlike those, you can extend it easily and increase its complexity once you need it.

And at that point you realize that this is all worth is and you will most likely not touch any other technology stack anymore. At least that's what I did.

I could go on for a long time, in fact I teach this stack in companies and gov-organizations during several days and I can only scratch the surface of what you can do with it. It does scale, I'm convinced by that by now and the tooling is getting better and better.

If you are interested start having a look at the Creative Commons course/slides we started building. There is still lots of content that should be added but I had to start somewhere: http://linked-data-training.zazuko.com/

Also have a look at Wikipedia for a list of SPARQL implementations: https://en.wikipedia.org/wiki/Comparison_of_triplestores

Would I use other graph databases? Definitely not. The great thing about RDF is that it's open, you can cross-reference data across silos/domains and profit from work others did. If I create another silo in a proprietary graph model, why would I bother?

Let me finish with a quote from Dan Brickly (Googles schema.org) and Libby Miller (BBC) in a recent book about RDF validation:

> People think RDF is a pain because it is complicated. The truth is even worse. RDF is painfully simplistic, but it allows you to work with real-world data and problems that are horribly complicated. While you can avoid RDF, it is harder to avoid complicated data and complicated computer problems.

Source: http://book.validatingrdf.com/bookHtml005.html

I could not have come up with a better conclusion.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact