Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Implementing a graph database using Postgres tables for nodes and edges?
87 points by arunnanda on Oct 2, 2015 | hide | past | favorite | 60 comments
After hearing/reading about the negative experiences many people have had with upcoming graph databases, I am wary of jumping into a graph db. But almost any modern application has things that will be suited to a graph representation.

So I am considering a workaround to use a graph structure, but represent the underlying data in a reliable rdbms, in particular postgres.

The approach is: 1. Make two parent tables: node and edge. 2. Make separate tables for all "objects" (like people, places, things, etc.) which would be inherited from node, so they would all have a unique node ID. 3. Make separate tables inherited from edge which would be used to represent the many many relations. So each relation has an edge ID, and each table inherited from edge can be thought of as representing a specific kind of relation (like person lives in place, or person is friends with person).

One thing I have observed so far, is a large number of tables with few columns, but I think that lends itself to the advantage of easy indexing. There can be a large number of individual queries from the front end, but I believe I can use views to make represent tables with more comprehensive info to reduce the number of queries.

What do you guys think of an approach like this? What am I missing, what is wrong with it? I haven't come across this previously, and so am a bit nervous about the ramifications. Is someone else also doing this?




Storing the data isn't an issue. It's fairly trivial to come up with a number of very good solutions to describing a graph.

The issue is querying the data, specifically when that involves walking the graph.

The real question then, is how do you want to query the data? If you are willing to make an assumption now that adds a tight constraint on your future work... i.e. "We will only ever perform queries that are at most 1 or 2 hops from the nodes/edges being queried" then perhaps PostgreSQL will work for you.


You can do recursive graph queries with PostgreSQL: http://www.postgresql.org/docs/9.5/static/queries-with.html


Agreed.

For shallow or narrow queries of the graph this is the way to go.

But for deep and broad queries, if you want to perform some valuable analysis that involves traversing the graph some non-trivial distance from the point of origin, that's where PostgreSQL is going to let you down and you would probably be better off picking a more appropriate tool.

Same question though: How do you want to query your graph, and are you willing to limit yourself to not querying it in a certain way?


I'm pretty sure you can get comparable performance from a graph in SQL if you make sure you have covering indexes for the Node ids and the edges/adjacency list table.

This may be implementation-dependent but in theory, it's solid. I'm not sure what else a graph database would be able to do to beat an indexed adjacency list.


You raise a good point. I don't expect graph queries that go more than a few links away. To the extent that there will be such queries, I think appropriate indexing will help a great deal.

Implementation wise, it is not that different from a query with many joins anyway - which is a standard problem, and is addressed by materialized views as well as indexing. So as long as the complex queries don't need to be ad hoc and real time, I can prepare for them using materialized views.

However you are quite right in that complex, ad hoc, and realtime graph queries Will be a pain point. That is a risk I am prepared to tolerate in the short term. The thing is, I also need transactional integrity, so I am stuck with an rdbms as the primary solution. In the long term, I anticipate the need for both graph queries and transactions - thus, I think it is better to start with postgres and incorporate a graphdb as the need arises than the other way round.

Hopefully this line of reasoning make sense? I am really not experienced at this, so am trying to not bump into too many walls.


The boring answer is that it depends on your data and what you need to do with it. Insights like "all data structures are graphs!" are dangerous - it can actually add vagueness that might prevent more useful understanding of your problem. Excessively generic handling of data often ends in pain when it comes to the real world (performance and complexity) compared to domain specific approaches.

If you have multi GB graphs and need to run intensive classic graph analysis algorithms then a lot of that can come optimised out of the box with a dedicated graph db. They will still leave you with plenty of hard problems.

However, if you have small graphs or only need conventional crud operations then using a new and unfamiliar graph db would probably be insane.

If your case is the latter and postgres is your comfort zone and you can get a working solution quickly, I would be tempted to prototype along the lines you suggest. When you start getting performance problems or need graph analysis, you can then make a better decision whether a graphdb or restructuring your relational db will solve it the best.


Thanks, that was almost the same reasoning I used to settle on this approach.

I definitely need acid transactions, and also some graph capability - which may or may not grow in the future. More pertinently, the kind of representation I was attempting led to a very normalized design anyway, and I think it also lends itself to all but the most sophisticated of graph functionalities! But I am only just starting, so hopefully it works out well.


The PostgreSQL Conference 2011 :"Trees and Graphs in the database" https://www.pgcon.org/2011/schedule/events/357.en.html

* Trees in the database - Slides: http://www.slideshare.net/quipo/trees-in-the-database-advanc...

* Graphs in the database - Slides : http://www.slideshare.net/quipo/rdbms-in-the-social-networks...


Thanks for the links. Glad to know there are more experienced people pursuing this seriously!


I am a huge fan of the Tinkerpop Stack (http://tinkerpop.incubator.apache.org) -- they have written a common set of interfaces for graph databases. There is an implementation that can use Postgres as its storage engine (https://github.com/pietermartin/sqlg), there is one that can use ElasticSearch, I even seen one that uses MongoDB. Check it out


This made my day! Thanks, I will check it!


This sounds broadly sensible. It's basically a classic normalised relational design, which is as perfect as alligators or sewing needles, but with extracted supertables for nodes and edges.

I would challenge the idea that you need these supertables, though. What those allow you to do is to write graph queries which are generic (polymorphic?) over multiple kinds of node and edge. So as well as "find me all the people who are friends with my friends", you can write "find me all the people who are friends with my friends, or who have been to a place i've been, or who own a thing i own" without using a union. Do you actually need to write queries like that?

There are two downsides to the supertables. Firstly, more complexity, although it's a minor, or at least constant-factor, amount. Secondly, a loss of type safety. If your edges are defined in a supertable, then the columns which point to the ends of the edge have to be foreign keys to the node supertable. That means they can be any type of node; there's no way to constrain particular kinds of edge to connecting particular kinds of node. That seems like a considerable drawback to me.


True, as I was designing the tables, I was surprised to notice how normalized it looked.

The generic kind of queries you mention: actually, yes, I think I will need them.

Your second point on downsides is something I have/had to think about, thanks for raising it. I had some vague premonitions on those lines, but you helped make it concrete. The problem can be avoided by not making meaningless connections, but that's not a real solution. It won't be a preventive, but it could help to audit relationships - by having a script list out which tables FKs originate from.

Since I started writing this post, I have had two ideas on how to address this, it'd be great to have your opinion on them - I'm thinking out loud here.

1. Use referential integrity. Since each relationship/edge will be in its own inherited table, one could impose a constraint that each column_in_the_specific_edge_table REFERENCES another_column_in_a_specific_node_table. Like the columns in the person_lives_in_place relationship table must reference columns in the person table and the place table - each of which is also its own inherited node table.

2. More convoluted and a bit cumbersome, but if the previous/simpler approach is insufficient, one could create data types corresponding to each node type. And impose that as a type constraint on the edge tables.

But maybe the 1st option could actually work...what do you think?


Why re-invent the wheel ! All you need to do is properly understand your tool of choice. I have used Neo4j extensively for years and have had great results. Our biggest hurdle was understanding how to model the data for a graph properly. Get this right and everything falls into place. The one thing you are going to need is a good visualisation tool, so be prepared to roll one of these too :)


I was also wondering about the same thing:

Implementing a graph on top of a RDBMS is trivial, and if the semantics are correct (the same as the ones exposed by a graph db?), then I'm not sure why would people want to use a proper graph db.

I thought that probably it'd be an issue of performance: the "right tool for the job" that "does one thing and does it well" probably is leaner and more efficient. After all, unlike a trivial implementation, getting a graph on a RDBMS to perform well might not be that simple after all (still, your idea of inheriting tables might make things more flexible and maybe more efficient)

But then, when looking up some Neo4j benchmarks, the numbers seems to not be good at all:

http://baach.de/Members/jhb/neo4j-performance-compared-to-my...

I'd like to hear from someone that used Neo4j (also, other graph databases are interesting) in production, and benchmarked it against a RDBMS prototype, finding the former as the better solution of the two.


AFAIK:

A proper graph db is almost mandatory if there are complex, ad hoc queries that need to be made on real time data.

Using an RDBMS is great if the graph query types are going to be known in advance - so they can be prepared for using materialized views and indexes, and if they aren't too complex - so one doesn't descend into JOIN-hell.

But most applications aren't like that, and not all applications can be completely satisfied using only a graphdb. Hence the rise of the new multimodel databases like OrientDB and ArangoDB. So I think it is a question of what risk one is prepared to tolerate.


While the author admits that the documentation advises to use the traversal API (which gives the best performance) he goes with cypher over the REST interface which would never match a direct db connection. It's a fundamental flaw in his comparison.


I noticed that, but he's doing a single REST query for every timing:

https://github.com/jhb/neo4j-experiements/blob/master/query_...

There's no way that a single HTTP request is responsible for a difference in hundreds of seconds

(his benchmark code isn't really sound, but I don't see how this can affect a difference so striking)


I've done this a few times. First, see the list of PostgreSQL-based extensions and tools listed by other folks below; they're all good things to look at before you implement something from scratch. Regardless, this is probably a good approach if you are using Postgres anyway and just need a graph for part of your application.

For short graph traversals, recursive queries and joins perform surprisingly well, even when compared against dedicated graph databases -- at least, the open source ones I've been able to try, including neo4j. The ones I've tried perform better than Postgres mostly when you get more than two hops away -- but then at least Neo and the MySQL thing tend to suffer if the data set doesn't fit in memory.

I had at least one project which reverted from a dedicated graph database to Postgres specifically because it needed decent on-disk performance (2TB per shard).

Anyway, to your question: you could build a PostgreSQL database like that, but why would you? If you have distinct types of entities (person vs. furniture vs. town), why would you denormalize them into generic node interitance? Having a completely abstracted database with one big table called "things" and another big table called "relationships" seems really attractive before you actually do it. Then it starts to suck.

Relational databases are designed to perform well, and to be easy to query, with "normalized" databases in which different types of entities are represented by different tables and relationships are defined by foreign keys. The further you get away from that model, the harder the database gets to use.

This is one of the problems with using graph concepts for questions which are not, fundamentally, graphs. Yes, you can define people's nationality as a graph from location to nation, but that makes it pretty hard to do voter registration stats. Graphs are a great tool for answering many important questions, but they are not a great tool for answering all questions.

Now, if you are answering an actual graphing question (i.e. show me everyone who shares something in common with me via at least two links within 3 hops), then proceed. But figure out what kinds of questions you specifically want to answer before you get too far.


I noticed the same - the "explosion of tables" when I was doing the actual designing. But, almost accidentally, it also turned out that the tables were almost all either 5NF or 6NF. So I thought this is just a highly normalized setup anyway - something traditional databases are supposed to be designed for.

The list of Postgres-basd extensions mentioned in this thread turned out quite relevant actually! I'm still going through them all, seeing if I can reuse something..

Many thanks for the inputs, quite helpful.


some alternatives to check :

* RDFLib Store for PostgreSQL ( https://github.com/RDFLib/rdflib-sqlalchemy )

* Sparqlify is a SPARQL-SQL rewriter that enables one to define RDF views on relational databases and query them with SPARQL. ( PostgreSQL supported http://aksw.org/Projects/Sparqlify.html )

* PostgreSQL-Neo4j Foreign Data Wrappers https://github.com/nuko-yokohama/neo4j_fdw

* PgRouting (network ) http://docs.pgrouting.org/dev/doc/src/tutorial/analytics.htm...


Thank you so much for this list. I am looking into these... Maybe there's something I can reuse, but at the very least I can get some inspiration from checking them out.


I think you want to google "triple store" because that is effectively what you are trying to do AIUI. Triple stores are the persistence strategy of choice for RDF and other Semantic-Web-type applications.

Each row in a triple store stores a relationship between 2 objects, with unique IDs for the subject, the relationship type, and the object. The IDs could refer to other tables that store entity and relationship details.

There are specialised triple store DBMSs, but implementing one in PostgreSQL should be pretty easy.


No-one seems to have yet mentioned Facebook's TAO paper, which describes scaling basically this approach to 1 billion reads per second. https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/re...

The nice thing about your approach is that you can shard your objects by their id and your edges by their "from" id, and have all lookups go to one box when you shard. Throw in some caching and it scales to multiple boxes really well.


You said it!

Actually, Facebook's TAO was an early inspiration for me as well, but it is really customized and complex too, so probably not something for a tiny team. But the concepts from TAO are applicable to many similar scenarios.


If directed acyclic is sufficient for your use case, you can use this approach, which has worked perfectly for me in the past: http://www.codeproject.com/Articles/22824/A-Model-to-Represe...


This is a really interesting approach, thanks! I have a relational-to-graph problem that can be done as a DAG, and I was planning on using Postgres as a source of truth and mirroring out the data to Neo4J just for queries, but I think I might try this instead.


I use Neo4J and I'm curious what negative experiences you are referring to, and what benefit this might have over that. The biggest problem I foresee is querying and possibly data integrity. Querying since doing any kind of graph traversal would probably require writing hacks in software, and data integrity for much of the same reason. It may be easy to ensure a basic relationship but beyond that I would think it would be more difficult.


I was never able to use Neo4J as a first level database on a busy website. Simple traversals were just too slow. I ended up maintaining a simple in-memory structure on a pair of machines behind a load balancer with all writes going to sql and then propagated to the graph instances. I realise this introduces issues around read-your-writes etc, but good enough for my use case.


it's extremely memory hungry, and you need huge machines


I've done this a few times. Depending on the size/density of your data, it can be fine. If you have billions of entities and/or highly dense relations, it's not very efficient though.

I didn't use table inheritance, and I don't distinguish relationships at the query stage, so if that's core to your method you should think about it. I assume with that setup it'd be easier to take advantage of partial indexes though.

Recursive queries work but if the intermediate tables gets large it explodes fast. You can't use aggregates in the recursive function either (e.g. to count number of leaf nodes in a tree), you have to apply them at the end, in which case the intermediate table has to be large...

I've had decent experience so far with GIN indexes and @> operators on an ARRAY[] column adjacency list. In my case I stored "ancestors" so I could select by object id and get all ancestors, or use `ancestors @> ARRAY[parent_id]` to select all object ids as descendants.

Of course, if you don't need any "self" relationships, then none of the really complicated stuff matters...


Thanks for the feedback.

Indeed, one of the reasons behind inheriting tables was to use partial indexes - which should help with performance. Another was ease scaling out, if needed.

Using the @> operators on array columns is something I also looked into, mainly for materialized paths - I expect them to remain static, so no expensive updates to the paths/arrays. But actually, I don't think I will need many self relationships - that was the third reason for using inherited tables, so I could split entities into separate groups.

Of course, I have no idea how any of it will work out, since this was just a semi-serious line of inquiry initially, and everything was conceptual so far, except the feedback of experienced people such as yourself, on this thread - which has led me to pursue this seriously and actually try to build it out. After reading the responses on this thread, I now think it will be worth the effort.


I had a similar challenge and ended up using ElasticSearch in a thoroughly denormalized data format Allowed me the flexibility of stateful (contextual) queries etc. Bit of a hack, but served my purpose. Hope that helps.


Yes, it might be a hack [actually I don't think it is], but it should offer some powerful functionalities...


Yes, it might be a hack, but it should offer some powerful functionalities... I'll remember this one.


A lot of the early academic RDF graph databases used this approach. However, performance was no where near to what is required. This let to for the SPARQL/RDF graph databases to now a whole set of independently developed stores. Some on top of existing solutions e.g. Oracle semnet, Virtuoso and DB2 sparql. More on their own solid foundations.

You could test out 3 or 4 different SPARQL solutions in the time that it would take you to develop something graph like on your own.

On the other hand, cutting edge approaches, actually take a graph representation of data and lay it out in a relational manner. http://ercim-news.ercim.eu/en96/special/monetdb-rdf-discover... giving the best of both worlds.

In short you can build something yourself. But don't expect that it will be better than something build and supported by someone else.

So investigate the competition: BlazeGraph, Virtuoso, StarDog, Oracle 12c EE/semnet, DB9, Dydra before deciding to build your own. Building your own because its fun to do is great, but unless it pays your bills not a good idea for production environment.

PS. The edge table (EAV) is the major problem, it leads to a lot of self joins and difficult exercises for the query planner.

You can improve a lot on this if you can put "different" edges into different tables or partitions.

Triple stores with support for JSON-LD framing such as Dydra can also make it easier to have a front-end on top of your DB without extensive middle layer code.

A store like StarDog and BlazeGraph on the other hand gives you a lot of flexibility by both supporting SPARQL and TinkerPop. (both cluster and scale out, although BlazeGraph has GPL option. StarDog is only Commercial)


edge is not stored as IAV the table hierarchy is:

> Node <- Entities

> Edge <- M2M

which is strange since an edge is already a M2M.

Thanks for the monetdb link.


Ah ok, so in RDF terms each predicate gets its own table, which makes sense. Then you have a subject foreign key relation ship and a object foreign key relation ship to the node tables.

That would be better than one big table in performance, which was a major problem in the RDF on SQL databases.

Of course those accepted any graph, if one constrains the number of possible predicates/relations then this solution could more efficient.



I am a fan of this APIs base approach.

It is trivial to implement these APIs on top of simple SQL. I did that in a few days in both Python and go.


One technique you might look into is "Materialized Path":

* usually one single query to fetch an entire tree or subtree * more work on reconstructing tree from query result * requires prefix pattern matching operator, e.g. "like" from RDBMS


Thanks! That's a useful technique..


Dropbox does something similar called Edgestore which is built on top of MySQL: https://www.youtube.com/watch?v=VZ-zJEWi-Vo


Facebook as well with TAO https://www.facebook.com/notes/facebook-engineering/tao-the-... I think they presented it at one of the @scale conferences.


Building a graph database is time consuming. Why are you putting edge's children (person, place, friend) in a separate table instead of using a json field? Do you plan to query the person, place, friend table directly? You need to benchmark the different queries.

My understanding is that big5 and others do not use graphdb for all their stuff since they are not as good as rdbms to do queries with a single hop or JOIN. They embrace microservices and maintain a graphdb (in memory, persistent or distributed) to answer domain specific queries. That approach is similar to your schema except that graph queries run on a single node without superfluous network roundtrips.

It would be nice to be able to use a single database for all data related stuff to have atomic writes and simpler architecture. That's what multi-model databases are tackling have a look at OrientDB and ArangoDB [1].

Also, Tinkerpop, already mentioned in the thread, is a ready-to-scale graphdb with much love I recommend you have a look at Tales of the Tinkerpop [2].

[1] https://news.ycombinator.com/item?id=10180185

[2] https://news.ycombinator.com/item?id=10316140


Of course, JSON is a good way to store ad hoc, or new relations which haven't yet been structured. But for frequently occurring relations, I'm preferring separate tables instead of JSON for ease of querying and joins. If number of relationships are large, querying within a long JSON field is not the best approach.

Multi model db's are a good idea indeed. They were the first things I had considered, and extensively so - both OrientDB and ArangoDB. I came back to traditional solutions simply for practical reasons of maturity, and ease of finding professional support.

Someone here mentioned about a Tinkerpop implementation that uses Postgres as the storage engine, I think that might offer the best of both worlds, but I am yet to look into it.


The point about micro services is a really valid one. In principle, if the application is neatly split into a group of micro services, it would make things easier. This is something I am going to look into once the application has evolved somewhat - it is hard to make the split from the get-go, without knowing how it is going to be actually used - by the end user.


I've done this before with MySQL in 2011 after throwing my hands in the air with two different graph DBs.

By far the most productive thing I did. People will say that you need recursive queries, etc, but I found most of the time just keeping the query set in the application layer was easy enough and super fast. Scaling it was easier too because there are lots of articles on scaling the MySQL / Postgres.


Agreed, my reasons for coming back to Postgres are quite similar. In general, one only needs to go a few layers in the graph. Recursive queries are needed for specific use cases. Even then, querying from the app, like you said, and/or building materialized views and materialized paths are very good ways of working around this, as long as the query is anticipated in advance and the view/path built before the query. So that reduces the real use of graph db's to an even smaller subset - complex ad hoc graph queries.


I'd do the same. I think this book is still relevant: http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic...


If you want to help hack on this, I've written a Postgres backend for the graph database Cayley that's fairly fast and about to merge (for use on a prod feature): https://github.com/google/cayley/pull/289

For traversals, it tries its level best to let Postgres create the appropriate JOINs, which Postgres does well... most of the time anyway. In doing this, I learned really fast how many weird semantic corners of bog-standard SQL are, and the optimization choices Postgres makes based on the way you formulate the query.

Ping me on that PR! And/or join #cayley on Freenode


Thanks for sharing this! I would love to help with this - once I have gotten up to speed on various things by building a graph on Postgres myself.


What are you actually trying to do? If you just need some queries on a small amount of read-only data, objects in memory (in a dedicated process, perhaps) are a perfectly sane approach, and much faster than ping-ponging queries with any database.


I believe Redis and Hyperdex have constructs such as bitarrays and sets that can be easily used for representing and querying graphs/hypergraphs. For Redis there are some success stories and even ready to use extensions.


I don't know much about this but Joe Celko has a whole book on representing trees in SQL: http://www.amazon.co.uk/Hierarchies-Smarties-Kaufmann-Manage...

The obvious first question would be: why are you representing your data as a graph? Can you represent it better as sets of predicates (ie: relations)?

Using views should not reduce the number of queries: a view is just a query with a name. If you can do it by combining views you can do it by combining queries.


I would love to see a SQLite extension to support graphs...


This is fairly timely as I'm investigating possibilities for our data model and graphdb just popped up on my radar. Are gdb good for modeling multiple types of products? Let's say v1 of product has attributes a, b, c but v2 now adds x, y, z and removes a. Would that be 2 separate nodes?

I too am interested in understanding the downsides of a gdb (seeing some refs to performance but nothing concrete).


Graph db's are a very good idea. Its just that the solutions in the market are not very mature yet, and the community is not very large yet. Most of the downsides stem from these two facts.

So unless the solution you're building has a specific need for graph queries - in particular, complex, realtime and ad hoc graph queries, a traditional solution is a safer bet.

You can make v1 and v2 two separate nodes, if you need information of both of them. In this case you just link each version to its attributes. If you want the "latest" marked separately, you can have three nodes - one for "current", and the other two for the individual iterations. So initially "current" would link to a,b,c and later link x,y,z, and remove a. Or, you can have two nodes - current and previous iteration, and store the version number of the latest iteration as a field within the "current" node.


Do you have a gist that implements this approach?


I met with Kyle Kingsbury (Aphyr of Call Me Maybe the Jepsen database testing series) out in Berlin and I recall him mentioning at one point that Postgres was a good generic database especially with the JSON blob storage. So I am wondering if making the extensions you propose might not be worth it cause it will come with all that extra overhead.

I do however recommend exploring and even building your own graph database, it is a great learning exercise. But don't fool yourself by doing it on top of Postgres, actually jump in and become immersed. I built http://gunDB.io/ to scratch a similar itch, I wanted an easy graph database with realtime updates like Firebase. I would love to hear your feedback, or even have you contribute.




Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: