
Show HN: TerminusDB – An open source in-memory graph database - LukeEF
https://terminusdb.com/
======
LukeEF
TerminusDB is an open source (GPLv3) full featured in-memory graph database
management system with a rich query language: WOQL (the Web Object Query
Language).

TerminusDB originated in Trinity College Dublin in Ireland in 2015 when we
started working on the information architecture for ‘Seshat: the Global
Historical Databank’, an ambitious project to store lots of information about
every society in human history. We needed a solution that could enable
collaboration among a highly distributed team on a shared database whose
primary function was the curation of high-quality datasets with a very rich
structure, storing information about everything from religious practices to
geographic extent.

The historical databank was a very challenging data storage problem. While the
scale of data was not particularly large, the complexity was very high. First
there were a large number of different types of facts that needed to be
stored. Information about populations, carrying capacity, religious rituals,
varieties of livestock etc. In addition to this each fact had to be scoped
with the period over which the fact was likely to be true. This meant we
needed to have ranges with uncertainty bars on each of the endpoints. Then the
reasoning for the data point had to be apparent on inspection. If the value
for population was deduced from the carrying capacity, it is critical for
analysis to understand that this is not a "ground fact". But even ground facts
require provenance. We needed to store the information about which method of
measurement gave rise to the fact, and who had undertaken it. And if that
wasn't bad enough, we needed to be able to store disagreement - literally the
same fact might have two different values as argued by two different sources,
potentially using different methods.

On top of this we needed to be able to allow data entry by graduate students
who may or may not be reliable in their transcription of information. So
additional provenance information was required about who put the fact into the
database.

Of course, all of this is possible in an RDBMS but it would be a difficult
modeling task to say the least. The richness of data and the extensive
taxonomic information makes using a knowledge graph look more appropriate.
Given that Trinity College Dublin had some linked data specialists we opted to
try a linked data approach to solving the problem.

Unfortunately, the linked-data and RDF tool-chains were severely lacking. We
evaluated several tools in an attempt to architect a solution, including
Virtuoso and the most mature technology StarDog, but found the tools were not
really up to the task. While StarDog enforced the knowledge graph or ontology
as a constraint on the data, it did not provide us with usable refutation
witnesses. That is, when something was said to be wrong, insufficient
information was given to attempt automated resolution strategies. In addition,
the tools were set up to facilitate batch approaches to processing data,
rather than a live transactional approach which is the standard for RDBMSs.
Our first prototype was simply a prolog program using the prolog database and
temporary predicates which could be used within a transaction such that
running updates could have constraints tested without disrupting the reader
view of the database. (We thought that over time the use of prolog would be a
hindrance to adoption as logic programming is not particularly popular –
though there are now several green shoots!).

Not long after this we were asked to attempt the integration of a large
commercial intelligence database into our system. The data was stored in a
database in the 10s of gigabytes keeping information about all companies,
boards of directors and various relationships between companies and important
people over the entire course of the Polish economy since the 1990s. This,
unsurprisingly, brought our prototype database to its knees. The problem gave
us new design constraints. We needed a method of storing the graph that would
allow us to find chains quickly (one of the problems they needed to solve),
would be small enough that we didn't run out of main memory and which would
also allow the transaction processing to which we had grown accustomed. The
natural solution was to reuse the idea of a temporary transaction layer, but
to keep them around longer - and to make these layers very small and fast to
query. We set about looking for alternatives. We spent some time trying to
write a graph storage engine in postgres. Ultimately, we found this solution
to be too large and too slow for multi-hop joins on the commercial
intelligence use-case. Eventually we found HDT (Header-Dictionary-Triples). On
evaluation this solution seemed to work very well to represent a given layer
in our transactions as well as performing much better and so we built a
prototype utilizing the library.

Unfortunately, the HDT library exhibited a number of problems. First it was
not really designed to allow programmatic access which was required during
transactions and so we found it quite awkward. Second, we had a lot of
problems with re-entrance leading to segfaults. This was a serious problem on
the Polish commercial use case which needed multi-threading to make search and
update feasible. Managing all of the layering information in prolog was
obviously less than ideal - this should clearly be built into the solution at
the low level. We either had to fix HDT and extend it. Or build something
else.

We opted to build something else and we opted to build it in Rust. Our
specific use cases didn't require a lot of the code that was in HDT. We were
not as concerned with using it as an interchange format (one of the design
principles for HDT) and in addition to the layering system, we had plans for
other aspects to change fairly drastically as well. For instance, we needed to
be able to conduct range queries and we wanted to segment information by type.
HDT was standardized for a very different purpose and it was going to be hard
to fit it into that shoe. The choice of Rust was partly one borne of the pain
of tracking down segfaults in HDT (written in C++). We were willing to pay
some upfront cost in development time not to search, oft times fruitlessly,
for segfaults.

At this point our transaction processing system had linear histories. The
possibility of tree histories, i.e. of having something with branching was now
obvious and relatively simple to implement. It went on to the pile of things
which would be "nice to have" (a very large pile). It wasn't until we started
using the product in a production environment for a Machine Learning pipeline
that this "nice to have" became more and more obviously a "need to have".

As with any technical product of this nature there are many different paths we
could have followed, after spinning out from university and attempting to
pursue a strategy of implementing large scale enterprise graph systems with
some limited success, we decided to shift to open source and become TerminusDB
([https://github.com/terminusdb/terminus-
server](https://github.com/terminusdb/terminus-server)). Potentially a long
road, but a happier path for us. We also decided to double down on the
elements in TerminusDB that we had found useful in project implementation and
felt were weakest in existing information architectures. These are the ability
to have very rich schemas combined with very fine-grained data-aware revision
control enabling the types of Continuous Integration / Continuous Deployment
(CI/CD) used extensively in software engineering to be used with data. We
think that DataOps, basically DevOps for data or the ability to pipeline and
reduce the cycle time of data, is going to become more and more important for
data intensive teams.

The name ‘TerminusDB’ comes from two sources. The Roman God of boundaries is
named Terminus and good databases need boundaries. Terminus is also the home
planet of the Foundation in the Issac Asimov series of novels. As our origin
is in building the technical architecture for the Global History Databank, the
parallels with Hari Seldon’s psychohistory are compelling.

To take advantage of this DataOps/’Git for data’ use case we implement a graph
database with a strong schema so as to retain both simplicity and generality
of design. Second, we implement this graph using succinct immutable data
structures. Prudent use of memory reduces cache contention while write-once,
read-many data structures simplify parallel access significantly. Third, we
adopted a delta encoding approach to updates as is used in source control
systems such as git. This provides transaction processing and updates using
immutable database data structures, recovering standard database management
features while also providing the whole suite of revision control features:
branch, merge, squash, rollback, blame, and time-travel facilitating CI/CD
approaches on data. Some of these features are implemented elsewhere, the
immutable data structures of both Fluree and Datomic for example, but it is
the combination that makes TerminusDB a radical departure from historical
architectures.

In pursuing this path, we see that innovations in CI/CD approaches to software
design have left RDBMSs firmly planted in the past. Flexible distributed
revision control systems have revolutionized the software development process.
The maintenance of histories, of records of modification, of distribution
(push/pull) and the ability to roll back, branch or merge enables engineers to
have confidence in making modifications collaboratively. TerminusDB builds
these operations into the core of our design with every choice about
transaction processing designed to ease the efficient implementation of these
operations.

Current databases are shared knowledge resources, but they can also be areas
of shared destruction. The ability to modify the database for tests, for
upgrades etc is hindered by having only the single current view. Any change is
forced on all. The lack of fear of change is perhaps the greatest innovation
that revision control has given us in code. Now it is time to have it for
data. Want to reorganize the structure of the database without breaking all of
the applications which are using it? Branch first, make sure it works, then
you can rebase master in confidence. While we are satisfied with progress,
databases are hard and typically take a number of years to harden. It is a
difficult road as you need to have something which works while keeping your
options for progress open. How do you eat an elephant? One bite at a time.

We hope that others will see the value in the project and contribute to the
code base. We are committed to delivering all features as open source and will
not have an enterprise version. Our monetization strategy is to roll out
decentralized data sharing and a central hub (if TerminusDB is Git for data,
TerminusHub will be the ‘GitHub for data’) in the near future. Any cash that
we generate by charging for seats will subsidize feature development for the
core open source database. We are not sure if this is the best strategy, but
we are going to see how far it takes us.

~~~
emmanueloga_
> the linked-data and RDF tool-chains were severely lacking.

I'm curious if you evaluated any of these solutions and if yes why you found
them lacking:

* Jena

* RDF4j

* Blazegraph

Also, why not SPARQL? And if your query language is based on triples, why not
Turtle for serialization instead of JSON-LD? JSON-LD is not very human
friendly to write by hand, imho.

~~~
ggleason
We evaluated Jena (we actually prototyped the first system in Jena) and RDF4j.
At the time we didn't look into Blazegraph. In beginning the project we began
to develop hypotheses of the kinds of features we would need. Namely the
ability to check the instance data against a very complex and evolving schema.
The need to do schema and instance updates in a single transaction. The need
to recieve a program interpretable "witness of failure" (a refutation proof)
so that automatic strategies could be undertaken to correct failures in
certain cases.

As we went forward we also acquired the collaboration and time-travel
requirements to facilitate large team data curation - which aren't present in
these offerings.

As for SPARQL, it's essentialy a messy and truncated datalog. It just seemed a
bit silly not to use an actual datalog! Then as I've said elsewhere, we needed
to be able to do fast time-window queries which CLP(fd) is excellent at and
which SPARQL doesn't support.

JSON-LD is most definitely not human friendly, and turtle is certainly better.
In our interface you can manipulate the schema in turtle as it's far clearer.

The advantage of JSON-LD is as an interchange format. With turtle you are
pasting together strings. With JSON-LD libraries for querying the database are
just functions which build JSON-LD. It also simplifies parsing on the opposite
end.

But in addition JSON-LD + strong schema give you a way to treat segments of
the graph as documents! We can use our get_document word and you'll get back
the JSON-LD fragment of the graph associated with an ID. JSON-LD was carefully
designed so that this bidirectional interpretation is possible.

------
tlowrimore
This looks really great. It's interesting to see that 99% of the source code
is Prolog. I'm curious to know what advantage--real or perceived--did Prolog
provide over some other, more mainstream language?

Also, @ggleason, based upon your experience with other languages, would you
still use Prolog if you had to do the whole thing over again?

~~~
ggleason
Prolog implementations are very efficient at implementing back-tracking, so if
you end up using a lot of back-tracking it definitely makes sense. My first
prototype was started in java and it was a nightmare. Secondly, for writing
the query compiler, prolog was just such an elegant language.

SWIPL has a large enough and nice enough library that it makes it feel similar
to other dynamic languages (python, etc.) in terms of implementing run-of-the-
mill glue code.

I'm very fond of prolog as the implementation language for the constraint
checking, and especially CLP(fd). I think CLP(fd) is such a killer feature,
that once people start using it in their queries, they're going to wonder how
they got-on before.

I would like prolog to be a lot more feature-mature for the current age
however. It needs a bigger community to help flesh the language out! So many
things could be made better - better mode analysis, better type checking and
simply more libraries.

~~~
tlowrimore
Thank you so much for your response. You may have convinced me to have another
look at Prolog. I stumbled upon it 15 years ago, but never used it for any
real project. I just remember really loving its declarative style.

------
TheTank
Interesting project. Out of curiosity, why did you compare against kdb+ since
the models are very different? KDB is mostly used for in-memory time-series
while your seem to be a graph-oriented DB. Also, why did you choose to build
your own language instead of using an existing one [1]?

[1][https://en.wikipedia.org/wiki/GQL_Graph_Query_Language](https://en.wikipedia.org/wiki/GQL_Graph_Query_Language)

~~~
ggleason
The reason is that we were positioning to deal with customers who had
financial data which was stored as time-series. We aren't hoping to compete
with kdb+ on speed (which would be hopeless) but we have a prototype of
Constraint Logic Programming [CLP(fd)] based approaches to doing time queries
which is very expressive and which we hope to roll out in the main product in
the near future on hub.

The graph database is still in its infancy and there are a lot of graph query
languages about. We played around with using some that already exist
(especially SPARQL) but decided that we wanted a number of features that were
very non-standard (such as CLP(fd)).

Using JSON-LD as the definition, storage and interchange format for the query
language has advantages. Since we can marshall JSON-LD into and out of the
graph, it is easy to store queries in the graph. It is also very simple to
write query libraries for a range of languages by just building up a JSON-LD
object and sending it off.

We are firmly of the belief that datalog-style query languages which favour
composibility will eventually win the query language wars - even if it is not
our particular variety which does so. Composibility was not treated as
centrally as it should have been in most of the graph languages.

~~~
b3tt3rw0rs3
Composeability is front and center in GQL, so you may want to consider it.

~~~
LukeEF
We will 100% consider it and have been engaging with the community about the
best approach. Cypher is by far the biggest graph query language and they seem
to have the most weight in the conversation so far, but we are going to try to
represent datalog as far as possible. Even if woql isn't the end result we
think datalog it is the best basis for graph query so we'll keep banging the
drum (especially as most people realize that composability is so important)

------
laurencerowe
1\. Do you support JSON-LD Framing?

I have a set of JSON-LD documents which I would like to query across and for
each result return a nested JSON object for display in a user interface. For
example, let's imagine I was querying an employee database to find employees
with the job title "Software Developer" and for each matching Employee return
the following nested structure:

    
    
      {
        @type: "Employee",
        @id: "employee:bob",
        name: "Bob",
        job_title: "Software Developer",
        email: "bob@example.com",
        manager: {
          @type: "Employee",
          @id: "employee:alice",
          name: "Alice"
        }
      }
    

I can see how I could write the filter based on the pattern matching syntax
but I don't see how I could gather the data to produce the result (which in
reality might be much more deeply nested.)

2\. Have you benchmarked against RedisGraph?

They seem to have achieved very good performance from building atop of
GraphBLAS.

~~~
ggleason
1\. We do support framing. Our documents are defined using a special
superclass called "terminus:Document". Anything in the downward closed DAG up
to another "terminus:Document" is considered part of that document. You can
ask for some number of unfoldings of the internal documents by passing a
natural number - in which case you will frame the underlying documents. We
might extend this framing to allow more sophisticated approaches later if
there is interest.

2\. We have not performed benchmarks against RedisGraph. We intend to do
benchmarks in the future but are currently focusing on collaboration features
rather than raw speed.

I had just come up with some methods of using GPUs to speed up graph search
when I saw the RedisGraph whitepaper and that they had already done it. I have
to admit I was more than a little jealous! It's a good idea.

We'll look at the approach again in the future - our next steps are exposing
CLP(fd) in our query language.

~~~
laurencerowe
So if I understand correctly, unfolding one level would embed the referenced
documents in the result one level down, two would also embed the documents
referenced in the root document's referenced documents, etc.

I think it's definitely worth supporting differing depths of reference
embedding along different paths. Ideally though you want want to select which
properties are included as when you get to several levels of embedding the
resulting document is very large and often you only need a subset of
properties expanded for the embedded documents.

Additionally it can be very helpful to embed along reverse references
(parent.children from the reference stored as child.parent.)

Concrete example, take this page on a scientific data portal:
[https://www.encodeproject.org/experiments/ENCSR807BGP/](https://www.encodeproject.org/experiments/ENCSR807BGP/)

The JSON-LD document representing that object itself is:
[https://www.encodeproject.org/experiments/ENCSR807BGP/?forma...](https://www.encodeproject.org/experiments/ENCSR807BGP/?format=json&frame=object)
and many reference paths are expanded to generate the JSON required to
construct the UI:
[https://www.encodeproject.org/experiments/ENCSR807BGP/?forma...](https://www.encodeproject.org/experiments/ENCSR807BGP/?format=json)
(this is larger than it needs be as the entire referenced document is embedded
rather than just the necessary proeperties.)

While pre-generating the deeply embedded document means fetching it is fast,
as the site has grown keeping them up to date becomes challenging so I've been
looking at options for embedding dynamically.

~~~
ggleason
"So if I understand correctly, unfolding one level would embed the referenced
documents in the result one level down, two would also embed the documents
referenced in the root document's referenced documents, etc."

That's correct.

"I think it's definitely worth supporting differing depths of reference
embedding along different paths."

Yeah, path unfoldings were something we were thinking about but ended up on
our very tall stack of things we want to do. You can generally get around it
just by calling the get_document API on the client end from javascript,
although it is true this will be much slower.

------
bearjaws
"AI Code Generation"

I see this mentioned in the product comparison chart, but no mention of what
that actually means.

~~~
ggleason
Frankly I think this should be removed. Marketing sometimes get overzealous in
trying to present what makes us special.

~~~
LukeEF
Yes, we need to delete that. And actually think about all the product
comparison categories again.

Too many parts!

~~~
winrid
Yeah, two of the categories for MongoDB are wrong, for example..

~~~
LukeEF
Which two? On ACID, I hear the arguments on Mongo Atlas since they integrated
the wiredtiger tech (to basically become a RDBMS), but still not ACID. The
other is cloud native I suppose? Again, have moved along and we gave the a
balanced score.

~~~
matchy
MongoDB has full ACID support in the Database.

~~~
LukeEF
Mongo does not provide durability by default

------
ThePhysicist
Maybe a stupid question, but does it persist data on disk as well or is it a
pure in-memory approach e.g. for analytical workloads?

~~~
ggleason
Not a stupid question! The database is in memory but we journal all
transactions to disk, so it is persistent. In fact it's so persistent that it
never goes away. We have an append-only storage approach allowing you to do
time-travel. You can query past versions, or look at differences, or even
branch from a previous version of the database.

~~~
rochak
I read about append only storage when I was studying log structured file
systems. Is it somehow similar?

~~~
chekovcodes
Yes, although with databases as against log files, you also need to be able to
somehow append 'deletes' and updates to existing records as well as just
adding new records. The advantages of doing so is that it greatly simplifies
transactional processing and makes it much more parellelisable - because you
never change any existing data, just add new records on top. Blockchains have
similar 'immutable' characteristics. The other reason why append only storage
is desirable is that it allows you to time travel simply by backtracking
through the append-logs and you can then do stuff like reconstructing future
states by replaying all of the append events that got you there.

There are a variety of databases and database management systems that try to
do this - most of them run into problems with the meta-language needed to
describe updates and deletes to, for example, SQL Tables or something similar.
This is a hideously tricky and detailed problem because there are all sorts of
ways in which an SQL table can be changed many of which have implications for
all sorts of other bits of data and you have to capture all of this in your
'update' append log.

On the other hand, if like TerminusDB, you use RDF triples as an underlying
language, then the problem is pretty trivial - every update can always be
expressed as a set of deleted triples and a set of added triples.

~~~
rochak
But, wouldn't an append only database demand substantially more storage which
would also increase at a faster rate? How is that handled given we won't be
able to fit the database on a single disk?

Also, I wanted to understand how databases work and wanted to build one from
scratch. Is there any website/tutorial/blog you think could help me out?
Thanks a lot!

------
weeksie
Congrats Gavin and crew! Know y'all have been working on this for a while,
it's awesome to see it coming along.

------
mark_l_watson
Looks cool, and being written in swi-prolog is a nice bonus.

This is not a criticism, but I am curious why a new query language WOQL was
designed and implemented instead of just using SPARQL. It seems like it would
be not too difficult to write a converter between WOQL and SPARQL.

Also, swi-prolog has excellent semantic web libraries with SPARQL support.

~~~
ggleason
We debated using SPARQL initially, and even had a test implementation
leveraging the version shipped with SWI-prolog. However we found SPARQL to
have a number of short comings that we wanted to see addressed.

Firstly, there were features. Tighter integration with OWL and schema-
awareness was something we wanted to build in at a low level. We also wanted
to leverage CLP(fd) for time and other range queries. If we were going to need
to fundamentally alter the semantics anyhow, it didn't seem that important to
start with SPARQL. Other features that are coming down the pipes very soon are
access to recursive queries and manipulation of paths through the graph.

Secondly, we wanted better composability. One of the real strengths of SQL as
a language is that it is very easy to compose language elements. By contrast
SPARQL feels quite ad-hoc, mimicing some of the style of SQL but losing this
feature.

Lastly, we wanted to have a language which would be easy to compose from
Javascript, or python. Javascript, because we live in a web age, and python,
because it's the choice for many datascientists. JSON-LD provides a nice
intermediate language in which to write queries in either of these languages.
No need to paste together strings! And then, because of the choice of JSON-LD,
we can naturally store our queries in the graph (and even query our queries).

~~~
mark_l_watson
Good answer, thanks for that! I am working on a commercial product to help
people form SPARQL queries, and I must admit pro-SPARQL bias. Your decision
makes a lot of sense.

