Hacker News new | past | comments | ask | show | jobs | submit login
RecallGraph – an open-source graph database, for version controlled graph data (github.com/recallgraph)
101 points by adityamukho on June 9, 2020 | hide | past | favorite | 25 comments

I had recently been thinking about something like this. A project I have been working on has lots of relational data in postgres, and keeps the history of all the data and relationships, but its so relational a graph is something we were considering for the next version. But the history is important. So this might actually fit the bill.

Has anyone had any experience with this yet?

I don't have experience with either RecallGraph or ArangoDB but I have been following the project for a few months now - kudos to the team for releasing 1.0!

Naturally there are a few other open source technologies exploring the "temporal graph" space, including TerminusDB [0], which uses a git-like branching history model for data collaboration, and Crux [1], which supports bitemporal history for transactional workloads (full disclosure: I work on Crux directly). Datahike [2] is also worth checking out if you've not seen it. Both Crux and Datahike are heavily inspired by Datomic.

However, if the history-aspects of your data model end up being more complicated than the graph-aspects of your data model, then you may well be better off using SQL temporal tables as implemented by Teradata / SAP HANA / DB2 etc.

[0] https://terminusdb.com/

[1] https://opencrux.com/

[2] https://github.com/replikativ/datahike

Like you, I don't have any experience with RecallGraph, but I have been following Arango closely (I work with TerminusDB). First impression is that it looks very interesting and going in a similar 'like Git but for data' direction as Terminus. I'd be somewhat concerned about building on Arango - without your own storage layer, some of the engineering challenges will be very tricky.

When we first built Terminus, we used Postgres as a store, but found that it was too slow for the types of queries we wanted to run. After a HDT detour, we built our own in the end(https://terminusdb.com/blog/2020/04/14/terminusdb-a-technica...)

I think graph dbs (crux, terminus or recall) are very natural places for revision control.

Of these Crux, looks quite interesting. How well can it scale? Can it handle billions of documents?

For us, history is mostly about being able to audit a record and understand it history, and occasionally undo a mistake. Don't need or want to go with things like SAP.

Crux is designed to scale directly based on how RocksDB (or LMDB) performs on a single node, in terms of: sustained ingestion throughput, KV seeks/sec, and the sheer quantity of KV data that can be supported on an array of local SSDs (i.e. easily many billions of small docs). At a higher level this means that point-in-time queries will maintain good performance regardless of how much history is stored, thanks to Z-order indexing [0], and the query algorithm only requires very modest amounts of memory because the KV indexes are lazily streamed out of Rocks and processed tuple-by-tuple (though having more memory is always going to speed things up!).

Beyond the scope of a single Crux node, horizontal read scaling comes for free due to the transaction time model of history (i.e. you can spin up N identical nodes to service all manner of wholly unrelated use-cases with consistent reads).

[0] https://en.wikipedia.org/wiki/Z-order_curve

There are those who've started integrating this product into their stack. You can find them at the project's Gitter community forum at https://gitter.im/RecallGraph/community

Note that graph-like data can be represented quite easily in relational DB's including postgres, and general graph queries are made possible via the recursive-CTE feature in newer SQL standards.

If the graph is not the important part but want a serverless time-history stateful db check fauna.com

Maybe check out Cayley, Google's graph database which can use postgres as a backend?

Seems useful to note that this is built on ArangoDB which is an open source multimodel NoSQL database with document, key-value and graph, also rapid full-text search. You can use the same language for all of them and combine query styles (graph + document in the same query, etc). I'm not affiliated in any way but ArangoDB fulfills requirements of mine and is awesome when Postgres isn't the right fit so I'm hoping it grows and thrives.


What are the most common Graph database use cases? When should I go with a Graph DB instead of a NoSQL or a SQL database?

Well, first off: A graph database is typically considered a type of NoSQL database. Second off, a lot of graph databases use a SQL database such as PostgreSQL as the storage engine.

What really distinguishes graph databases is the querying language. There are a lot of these out there - RDF, Datalog, Cypher and Gremlin. These are typically optimized for modeling and making it easy to query against data with a high degree of interconnectedness. So, taking an RDBMS as the baseline, and assuming that by NoSQL you meant something like a column or document store that offers poorer support for ad-hoc queries than an SQL database, a graph database would be moving in the opposite direction.

Sort of. There's technically not anything a graph database can do that can't be expressed in modern (i.e., since the early 2000s for most, or 2018 if MySQL is your jam) SQL. But sometimes it can take a fair bit of effort to do so. If you find yourself frequently getting lost in a quagmire of complex joins and recursive CTEs, a graph DB can be a real boon for the maintainability of your data layer.

I'm not so sure that many graph databases use a relational database as the data store. Some use Linear Algebra representations of the graph. Some use key-value stores. Some are proprietary implementations that we'll never know exactly how the data is represented under the covers.

RecallGraph at first glance looks a bit like TerminusDB that recently featured on HN [0]. In terminusdb data is stored like code in git, and you can time travel and do branch, merge, squash, rollback, diff, blame, etc. But TerminusDB is a semantic graph database based on OWL schemas, which stores data as RDF and querying delivers JSON-LD. I will certainly give RecallGraph a closer look.

[0] https://news.ycombinator.com/item?id=22867767

The only GraphBLAS(linear algebra) graph database is RedisGraph right?

A Graph in the computer science world is a type of data model. There are many problems that are easier to solve with that kind of data model. For example: What is the shortest path between two locations on a map? (ie, every time you ask google for directions) What is the single point of failure in this network? How am I connected to pbg on Linked In? How are these financial crimes connected? Who is the biggest "influencer" in my facebook network? How do diseases spread? How do forest fires spread? I want to make a phone call, how does it get routed (with old telephone switch technology) across the country? I have 10 rooms, 30 speakers, and 1000 attendees in my conference. How do I arrange the speakers and conference rooms for an optimal conference schedule? I have a bunch of pilots who speak different languages and are qualified to fly on a variety of aircraft how do I maximize the number of planes in the air at any one time? How do I send my garbage trucks out to collect the garbage and use the least amount of fuel?

I get graphs and perhaps some graph algos (maximal flow etc) but I've never used a graph DB. Is this really how it works, cos what you're describing sounds more like some kind of generic-optimiser-in-a-box

I'd say it's more about query language, and what types of queries the DB is optimized for. Graph DBs come with query languages that let you directly ask questions like "starting from node $foo, select all nodes and edges that lead to node $bar, but only for paths consisting of edges with property $xyz > 42".

An example from Neo4J documentation:

  MATCH p =(charlie:Person)-[* { blocked:false }]-(martin:Person)
  WHERE charlie.name = 'Charlie Sheen' AND martin.name = 'Martin Sheen'
which, per documentation, "returns the paths between 'Charlie Sheen' and 'Martin Sheen' where all relationships have the blocked property set to false".


Graph DBs are designed for modelling your data as nodes with properties, connecting by directed edges, with properties, to another nodes; they're also internally optimized for doing such queries.

What you're usually gaining with a graph database: a query language designed primarily for operating on graphs, and a design tuned to be very efficient at some subset of queries one might want to make against graphs. They're typically gonna be worse than other database types at other types of queries or data-fetching generally, and sometimes even for certain kinds of graph-focused queries, so watch your ass. Consider: if it were possible to tune on-disk and in-memory data-structures for excellent performance with all query types, every database would do it—graph database make lots of trade-offs, typically, so make sure you mostly need to do the thing they're fast at.

If you don't have dense (many edges per node) and very large graphs and a need to do various things with them that could basically qualify as a form of path-finding, then you probably don't need/want a graph database.

If you need great performance at things graph databases are good at but also great performance at things PostgreSQL (or whatever) are good at, you can always run both and, say, use queries against your graph DB to inform what you fetch out of your SQL DB. This is less than ideal in a lot of ways (you now have a distributed system even if you didn't otherwise want/need one) but, especially if your graph DB data can be derived from your SQL db so you don't have to worry too much about it getting screwed up, can make sense if you really need to do both things. I think this is how a lot of big players use them—recommendation engines and such querying against a graph DB, but e.g. invoices or inventory somewhere a little more general and robust.

[EDIT] to rebut a post downthread, just "my data model has lots of joins" is not a strong indication, per se, that you should use a graph DB. If your schema or some important and large (data-size wise) part of it consists of a couple tables and a lot of expensive recursive queries over those searching for things without much idea of how deep the recursion will go in advance, then you might want a graph DB.

[EDIT EDIT] Nb. graph DB companies may market to you that they are a suitable or even superior replacement for other types of database for most any purpose. Don't believe them. At all. They are trying to mislead you to make more sales (think: early MongoDB). Do your own research.

I work on arguably the most interesting of all machine learning tasks: semantic parsing (https://github.com/sebastianruder/NLP-progress/blob/master/e...)

Such a tasks would benefit from a graph database. I'm currently going the mongoDB route but eventually a true graphDB would be a better fit

Very interesting. I work with TerminusDB and we've been thinking a lot about how to apply a revision control semantic graph db to ML tasks. The whole MLOps process is fragmented and we think a collaborative revision control (like git but for data) that allows all of the parts to work together (data engineer, data scientist, ML engineer) could be very useful.

I had never heard of both TerminusDB and MLOps so thx for sharing !

A git for data like you describe seems intuitively (but should be well defined) to be a technology very useful for many things. From safely versioning knowledge a la mediawiki to versioning business data in DBs and making it seamlesss for all the human pipeline (data engineer, data scientist, ML engineer).

Actually I have a startup Idea that would require somthing similar yet different: I would need both version control for user data AND guaranteed immutability of what users have wrote. It would allow users to trust that the server cannot modify their data. For such a use case, the first things that comes to mind are blokchains but the technology feels too limiting. The only offer that I'm aware of as a general SQL DB is https://aws.amazon.com/qldb/

BTW git but for data is an idea that has a lot of competing implementations, it would be nice for your landing page to explain what differentiates you from e.g -> https://news.ycombinator.com/item?id=22731928

Anyway I wish you a good luck in this fun and probably useful project !

I'd say... anytime you are about to shove the square peg in the round hole...

More concretely, observe the complexity of nested sets [1] ... So if you need to represent trees, hierarchies or, well, graphs, in your data, maybe it could be reasonable to use a graph database instead of a relational one.

1: https://en.wikipedia.org/wiki/Nested_set_model

It's useful to separate graph-shaped problems from graph-db-shaped problems. From what we (Graphistry) see when working with folks here:

1. Graph-shaped, and generally fine without a graph DB:

* You / your app wants to run some graph algorithms, it fits in CPU/GPU memory, you have the data elsewhere, and it's easily stitched into a graph. We regularly do 1000-1B nodes/edges on one GPU node. SQL/CSV/Parquet/Splunk/Spark query -> node+edge table -> ... . Ex: Correlating user journeys, mapping host/network IT/security log activity, analyzing bots, ... .

* You want to visually explore ^^^^ as graphs/relationships/correlations (where we often come in for Graphistry)

Having to manage 2 systems of record for some data to get some algorithmic/usability benefits is terrible, so often I recommend your regular DB + on-the-fly graph compute like ^^^^ .

The upcoming security session of LearnRAPIDS.com will walk through some of this.

2. Graph search + graph enrichment, esp. on heterogeneous data or on > 1B nodes/edges.

2a. Graph query languages provide genericity not seen in normal SQL/NoSQL. Ex: An analyst or an ML algorithm wants to get a 360 on all data associated with some value, maybe a couple hops out. There may be many types of data available. In SQL/NoSQL land, you need to know all the ways to pivot ahead of time (Users.id -> Customers.user_id --phone--> Calls.phone), and pray that the Join queries don't tank the system either as one-off queries or in throughput scenarios.

2b. Graph DB impls can efficiently run certain search queries other DBs cannot. When your searches have extra fun patterns, like "between user A and user B, find all paths", and "Process A talks to Process B, which creates File C, which ...", this can be a big deal.

Growing in # of Tables or # Rows both make these more important.

3. Graph management, whole-graph analytics, write-heavy

* DB management can be good for auth & locked schema reasons even early on; part of why we did Neo4j early for ProjectDomino.org

* When working set sizes do start hitting say 100M or 1B, you may have a variety of queries where you don't want the overheads of going from scratch for everything (#1), esp. in a multi-user/service arch.

* Likewise, when data grows to multi-node & write-heavy, you may want it always on. An ephemeral system can be good (no state!), but if writes are needed to and you don't want 2 systems, a graph db may be a good system.

We get involved in all 3 categories of graph projects, am happy to help.

If your queries require a lot of joins, then a graph database will make things easier.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact