
A relatively simple Datalog engine - dmit
https://github.com/frankmcsherry/blog/blob/master/posts/2018-05-19.md
======
stickykbd
For those who may have missed it, Mozilla is working on Datalog knowledge
store backed by SQLite called Project Mentat [1] (also written in Rust).
Announcement and rationale written by one of its devs can be found here [2].
Also previous HN discussion is located here [3].

[1] [https://github.com/mozilla/mentat](https://github.com/mozilla/mentat)

[2] [https://medium.com/project-tofino/introducing-datomish-a-
fle...](https://medium.com/project-tofino/introducing-datomish-a-flexible-
embedded-knowledge-store-1d7976bff344)

[3]
[https://news.ycombinator.com/item?id=13568559](https://news.ycombinator.com/item?id=13568559)

~~~
trurl
I don't see any mention of recursion. If it doesn't support recursion it is
not really accurate to call it Datalog.

To clarify, if you don't have recursion, you've just got a slightly different
skin over relational algebra.

------
steveklabnik
For some interesting backlog here: As Rust moves from lexical to non-lexical
lifetimes, there's been a few different ways of actually implementing it. The
initial implementation had some bugs and was fairly slow; but Niko had some
other ideas:
[http://smallcultfollowing.com/babysteps/blog/2018/04/27/an-a...](http://smallcultfollowing.com/babysteps/blog/2018/04/27/an-
alias-based-formulation-of-the-borrow-checker/)

The initial implementation used differential dataflow, but this was just
merged in: [https://github.com/rust-lang-
nursery/polonius/pull/36#issuec...](https://github.com/rust-lang-
nursery/polonius/pull/36#issuecomment-390393717)

Very cool stuff!

~~~
DannyBee
Yes, this is a very traditional alias formulation of these sets of problems,
and there is a lot of work in the past 10 years on expressing and computing
aliasing like systems in datalog.

I don't know about rust, but the cost of that abstraction elsewhere is pretty
good. Datalog will beat basic implementations, and those that most researchers
have time to implement handily. However, very well engineered purpose built
solvers still beat datalog engines by a large factor (again, no theoretical
reason this should be true, but it is), so production compilers still
currently eat the complexity of building alias solvers. It would be nice if
one day that was not true.

~~~
tom_mellior
> However, very well engineered purpose built solvers still beat datalog
> engines by a large factor (again, no theoretical reason this should be true,
> but it is)

Sure there is a reason: Many applications, alias analysis in particular, only
use a subset of Datalog. Subsetting languages very often makes for easier and
more efficient implementation. If you know that there is no negation in your
alias problems, your special purpose solver can be much more efficient than a
Datalog solver that must be prepared to deal with negation.

Of course you might wish for a generic Datalog solver that you could run in a
special mode where you promise that it will not encounter negation in the
problem you give it. But that is then no longer a Datalog solver but a "well
engineered purpose built solver".

As for the "built in" part, anything that is not built in must cross an API
boundary and usually translate between different data representations. That is
a theoretical reason for being slower.

~~~
DannyBee
FWIW: None of this is a theoretical reason. This is no different than the
problems any other compiler/JIT faces.

What you've offered are practical reasons it hasn't happened yet, but not
theoretical reasons.

~~~
tom_mellior
"System A _must_ do more work than system B" seems a pretty theoretical reason
why, in the limit (after sinking an arbitrary number of engineer hours in
each), system A must be slower than system B. But your interpretations of the
word "theoretical" may differ.

~~~
frankmcsherry
I think the point is that there is no clear reason System A _must_ do more
work than System B. Serious Datalog environments _will_ identify strictly
monotonic programs and execute them more efficiently (e.g. with racy
asynchrony); they _will_ identify linear recursion and partition work
accordingly. They could identify programs corresponding to DannyBee's
hypothetical purpose-build industrial winners and execute them using the exact
same techniques, but apparently they don't.

I suspect the point of disagreement is

> Of course you might wish for a generic Datalog solver that you could run in
> a special mode where you promise that it will not encounter negation in the
> problem you give it. But that is then no longer a Datalog solver but a "well
> engineered purpose built solver".

which I don't think is correct. A Datalog solver with the property that it can
run faster on certain classes of input is great; it is not "no longer a
Datalog solver".

~~~
tom_mellior
> Serious Datalog environments will identify strictly monotonic programs [...]

Yes. And that is extra work that _must_ be done. Admittedly, it probably isn't
the "large factor" mentioned by DannyBee.

> A Datalog solver with the property that it can run faster on certain classes
> of input is great; it is not "no longer a Datalog solver".

I was talking of a special, user-requested mode in which the user guarantees
that the input program they are interested in will not contain certain
constructs, and the solver can reject programs that do contain them. A
"Datalog" solver that rejects programs containing, say, negation, does not
handle the full Datalog language and is therefore no longer a Datalog solver.

As an analogy, Clang is a C++ compiler, but passing it the "-xc" flag causes
it to reject valid C++ programs, so then it's no longer a C++ compiler
(although it is still the same program!).

------
triska
Nice! It's always good to see Datalog in action. In my experience, Datalog is
often more convenient and more readable than SQL.

As the post mentions, the current ergonomics (" _a wall of definitions_ ") are
indeed the main shortcoming of the present approach, which is syntactically
far removed from Datalog.

My suggestion towards an automated conversion is to consider Prolog:
Syntactically, a Datalog clause is a valid Prolog term. In fact, Datalog is a
syntactic subset of Prolog. Therefore, it can be easily parsed with Prolog
(using the predicate read/1), and then inspected via built-in mechanisms. You
may be able to write a Prolog program that converts proper Datalog definitions
to such Rust snippets.

~~~
a-saleh
I think it will be better if we figure out how to do embedded catalog in other
languages. I.e. I really liked The research here
[https://github.com/rntz/datafun](https://github.com/rntz/datafun), where we
start with "datalog=relational algebra+fixpoints" and then are trying to
figure out how far can we take this in language where not everything is a
table of facts :)

~~~
convolvatron
schelog by Dorai Sitaram
[https://ds26gte.github.io/schelog/index.html](https://ds26gte.github.io/schelog/index.html)
is a classic version of a funtional/declarative hybrid as a purely operational
artifact. apparently its still being updated, but I remember it from many
years ago (early 90s?).

it uses the ability to return multiple times from a continuation to allow sets
of results (streams or solution trees) to be expressed directly in scheme and
intermixed freely with the functional subset.

a 'nice' mixture of declarative and procedural would be really slick. i should
go back and look at mozart/oz again.

I'll definately see if my limited ability to deal with formalisms lets me
understand the type system proposed in Datafun.

~~~
a-saleh
"Streams of solution trees" kinda reminds me of mikro-kanren [1]. When I was
using Clojure roughly 3 years ago, kanren was quite popular embedded logic.

W.r.t. Datafun, i learned about it from a talk its creator gave on strangeloop
last year [2], for me it seemed quite approachable :)

[1]
[http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf](http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf)
[2]
[https://www.youtube.com/watch?v=gC295d3V9gE](https://www.youtube.com/watch?v=gC295d3V9gE)

------
xvilka
Since DataLog is just a subset of Prolog I want to link a very good book about
modern Prolog and its applications [1].

[1]. [https://www.metalevel.at/prolog](https://www.metalevel.at/prolog)

~~~
trurl
While it is possible to view Datalog as a subset of Prolog, they are really
very different languages. In particular Datalog implementations generally use
forward rather than backwards chaining evaluation. As such, programs you might
write in Datalog might diverge in naïve Prolog implementation. Depending on
what variant and extensions of Datalog you are using, a program that you could
write in the subset of Prolog that matches Datalog might similarly diverge in
a Datalog engine.

So really linking to a text on Prolog is not really relevant to a discussion
of the merits of Datalog.

~~~
triska
Prolog and Datalog also have a lot in common: In addition to Datalog being a
syntactic subset of Prolog, the semantics of Prolog and Datalog, and hence the
solutions you obtain, are also _identical_ if you enable SLG resolution in
your Prolog system and run a Datalog program with it.

SLG resolution (also called "tabling" in Prolog) prevents both positive and
negative loops, and always terminates for programs that have the bounded term-
size property, as is the case for all Datalog programs. XSB Prolog is a
notable example of a Prolog system that supports this mode, which you can
enable with the table/1 directive:

[http://xsb.sourceforge.net/shadow_site/manual1/node2.html](http://xsb.sourceforge.net/shadow_site/manual1/node2.html)

Therefore, you can use a Prolog system such as XSB Prolog also to learn
Datalog, and then seamlessly move to Prolog. Conversely, if you know Prolog,
then learning Datalog is very easy. Another very nice system for learning
Datalog is DES, which you can run in SICStus Prolog and other systems:

[http://des.sourceforge.net/](http://des.sourceforge.net/)

Various extensions such as _aggregate functions_ and object-oriented
programming (which are mentioned in the Wikipedia entry for Datalog) are also
applicable to both Prolog and Datalog with identical or closely related
semantics.

~~~
trurl
Yes, I'm well aware of tabling, which is why I said "naïve Prolog
implementation". I'm also aware of using magic sets and demand transformations
to allow writing Prolog-like relations in Datalog.

I also work on an industrial strength Datalog implementation and I can tell
you unequivocally there is very little relationship with how a real Datalog
engine is implemented and a Prolog implementation.

And no, learning Prolog will not help you learn Datalog. Unless you're going
to always us tabling or demand transformations, the difference in evaluation
model requires writing in a very different style.

~~~
triska
Yes, such implementation differences definitely exist!

Whether one emphasizes the differences or commonalities of Prolog and Datalog
may also depend on commercial interests: For vendors of Datalog systems, it
may be advantageous to stress the differences. For Prolog vendors, it may
conversely be advantageous to stress their commonalities.

Personally, I learned Prolog before Datalog, and it was easy for me to learn
Datalog because I was already familiar with Prolog. The reason may be that I
try to write as declaratively as possible when programming in Prolog, and that
carried over well to Datalog too. More recently, I have also used tabling in
many Prolog programs, and often obtained very general programs with good
termination properties, similar to Datalog definitions.

------
wernsey
Awesome. I look forward to browsing its code.

I implemented a Datalog engine in Java a while back. Looking back on it now
there are a bunch of things I might have done differently, but I've never
tried to implement anything like it before.

I remember that doing the stratified negation was quite complex and I won't be
able to explain it of the top of my head anymore.

Anyway, here's my version:
[https://github.com/wernsey/Jatalog](https://github.com/wernsey/Jatalog)

~~~
ScottBurson
I may actually have a use for this. Thanks for releasing it!

BTW the use of functional maps, such as those in [0], might have advantages
over things like StackMap. Of course it would require some restructuring of
the code.

[0] [https://github.com/slburson/fset-java](https://github.com/slburson/fset-
java)

~~~
wernsey
Thank you. I'll take a look.

In retrospect I think there might be several ways to improve performance by
using other data structures to index the facts, but I haven't had the time to
work on it lately. There is a pull request from someone that I'm still looking
at, which might just spur me on to think about it some more.

------
justincormack
If you want a Go implementation, look at OPA [https://github.com/open-policy-
agent/opa](https://github.com/open-policy-agent/opa) \- designed for
authorization but quite general.

------
myWindoonn
What would be some advantages of this design over the µKanren design [0] which
uses a search monad [1] and expresses unification and relations directly as
plain values and functions?

[0]
[http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf](http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf)

[1] [http://homes.soic.indiana.edu/ccshan/logicprog/LogicT-
icfp20...](http://homes.soic.indiana.edu/ccshan/logicprog/LogicT-icfp2005.pdf)

~~~
tom_mellior
Those papers and systems deal with Prolog-like languages, not Datalog-like
ones. These are different languages, even though there is a common set of
similar-looking programs with similar-ish declarative semantics but completely
different procedural semantics. In particular, it's easy to write programs
that easily compute the desired results when run on a Datalog system, but that
run into an infinite loop when run on a Prolog system.

Also, the solver here is special-cased to treat only a certain subset of
Datalog that is much simpler than the full language.

~~~
convolvatron
it looks like it doesn't handle negation. one of those unfortunate but
interesting features of the space of recursive declarative languages.

if you allow something to be true only if something else is not true, your
solution space is no longer monotonic. meaning that a result you thought was
part of the solution can later be retracted. if there is a cycle (trivially "a
if not a"), then you may not ever arrive at an answer. that might seem simple
to detect, but that cycle can exist through an arbitrary number of rules and
data dependencies, so in the limit is undecidable.

ideally, as in this solver, you could keep adding terms you could show to be
correct from the data and rules at hand until they and all their consequences
have been found, and which point you can stop (fixed point). but negation
screws up this plan.

there are various restrictions on negation, solver heuristics, and partial
conservative proofs of termination that can be applied. but this is where your
beautiful simple declarative world starts getting sad and messy.

[edit: I'd like to leave this because I wish I understood the problem with
negation earlier, but I just skimmed and missed stratification, Frank
thoughtfully explains below]

~~~
frankmcsherry
I'm drinking at the moment, so maybe messy but not sad.

The engine supports negation in the standard way: stratification. You can call
'from_antijoin' with a static relation for its second argument but not a
variable. It is not discussed in the post because it is a bit of a diversion
from the intended message (and because antijoin is computationally not much
different from a filter).

If you would like a more general compute model with non-monotonic iteration, I
recommend differential dataflow!

