Hacker News new | past | comments | ask | show | jobs | submit login
A relatively simple Datalog engine (github.com)
207 points by dmit 11 months ago | hide | past | web | favorite | 28 comments

For those who may have missed it, Mozilla is working on Datalog knowledge store backed by SQLite called Project Mentat [1] (also written in Rust). Announcement and rationale written by one of its devs can be found here [2]. Also previous HN discussion is located here [3].

[1] https://github.com/mozilla/mentat

[2] https://medium.com/project-tofino/introducing-datomish-a-fle...

[3] https://news.ycombinator.com/item?id=13568559

I don't see any mention of recursion. If it doesn't support recursion it is not really accurate to call it Datalog.

To clarify, if you don't have recursion, you've just got a slightly different skin over relational algebra.

For some interesting backlog here: As Rust moves from lexical to non-lexical lifetimes, there's been a few different ways of actually implementing it. The initial implementation had some bugs and was fairly slow; but Niko had some other ideas: http://smallcultfollowing.com/babysteps/blog/2018/04/27/an-a...

The initial implementation used differential dataflow, but this was just merged in: https://github.com/rust-lang-nursery/polonius/pull/36#issuec...

Very cool stuff!

Yes, this is a very traditional alias formulation of these sets of problems, and there is a lot of work in the past 10 years on expressing and computing aliasing like systems in datalog.

I don't know about rust, but the cost of that abstraction elsewhere is pretty good. Datalog will beat basic implementations, and those that most researchers have time to implement handily. However, very well engineered purpose built solvers still beat datalog engines by a large factor (again, no theoretical reason this should be true, but it is), so production compilers still currently eat the complexity of building alias solvers. It would be nice if one day that was not true.

> However, very well engineered purpose built solvers still beat datalog engines by a large factor (again, no theoretical reason this should be true, but it is)

Sure there is a reason: Many applications, alias analysis in particular, only use a subset of Datalog. Subsetting languages very often makes for easier and more efficient implementation. If you know that there is no negation in your alias problems, your special purpose solver can be much more efficient than a Datalog solver that must be prepared to deal with negation.

Of course you might wish for a generic Datalog solver that you could run in a special mode where you promise that it will not encounter negation in the problem you give it. But that is then no longer a Datalog solver but a "well engineered purpose built solver".

As for the "built in" part, anything that is not built in must cross an API boundary and usually translate between different data representations. That is a theoretical reason for being slower.

> But that is then no longer a Datalog solver but a "well engineered purpose built solver".

I don't really agree. Specialized implementations for instances of a problem that can solved more efficiently than the general case are a pretty fundamental part of any mature solver (GMP is probably the most well-known example). This includes even cases where the user says to try the specialized algorithm and the solver checks that you indeed have an instance of it, rather than the specialized subproblem being automatically detected by the solver (there are many examples, but I always think of optimizations for read only transactions in databases--generally you have to declare the transaction read-only up front).

> As for the "built in" part, anything that is not built in must cross an API boundary and usually translate between different data representations. That is a theoretical reason for being slower.

The abstractions required for getting solvers to perform well on problems like this are usually sufficiently far removed from the representation you'd have had if you didn't have to solve the problem that (1) transforming it is only a small part of the overall time to solve, and (2) it is often a cost you already have to pay to some extent.

FWIW: None of this is a theoretical reason. This is no different than the problems any other compiler/JIT faces.

What you've offered are practical reasons it hasn't happened yet, but not theoretical reasons.

"System A must do more work than system B" seems a pretty theoretical reason why, in the limit (after sinking an arbitrary number of engineer hours in each), system A must be slower than system B. But your interpretations of the word "theoretical" may differ.

I think the point is that there is no clear reason System A must do more work than System B. Serious Datalog environments will identify strictly monotonic programs and execute them more efficiently (e.g. with racy asynchrony); they will identify linear recursion and partition work accordingly. They could identify programs corresponding to DannyBee's hypothetical purpose-build industrial winners and execute them using the exact same techniques, but apparently they don't.

I suspect the point of disagreement is

> Of course you might wish for a generic Datalog solver that you could run in a special mode where you promise that it will not encounter negation in the problem you give it. But that is then no longer a Datalog solver but a "well engineered purpose built solver".

which I don't think is correct. A Datalog solver with the property that it can run faster on certain classes of input is great; it is not "no longer a Datalog solver".

> Serious Datalog environments will identify strictly monotonic programs [...]

Yes. And that is extra work that must be done. Admittedly, it probably isn't the "large factor" mentioned by DannyBee.

> A Datalog solver with the property that it can run faster on certain classes of input is great; it is not "no longer a Datalog solver".

I was talking of a special, user-requested mode in which the user guarantees that the input program they are interested in will not contain certain constructs, and the solver can reject programs that do contain them. A "Datalog" solver that rejects programs containing, say, negation, does not handle the full Datalog language and is therefore no longer a Datalog solver.

As an analogy, Clang is a C++ compiler, but passing it the "-xc" flag causes it to reject valid C++ programs, so then it's no longer a C++ compiler (although it is still the same program!).

A few years ago as part of my MS program, I used Protobufs and Datalog to try to write program analyses over Java source code. I ended up implementing a simple alias analysis using this framework, and then drafting a paper--essentially a tutorial--about the design of analyses in this style; alias analysis was my detailed example.


Nice! It's always good to see Datalog in action. In my experience, Datalog is often more convenient and more readable than SQL.

As the post mentions, the current ergonomics ("a wall of definitions") are indeed the main shortcoming of the present approach, which is syntactically far removed from Datalog.

My suggestion towards an automated conversion is to consider Prolog: Syntactically, a Datalog clause is a valid Prolog term. In fact, Datalog is a syntactic subset of Prolog. Therefore, it can be easily parsed with Prolog (using the predicate read/1), and then inspected via built-in mechanisms. You may be able to write a Prolog program that converts proper Datalog definitions to such Rust snippets.

I think it will be better if we figure out how to do embedded catalog in other languages. I.e. I really liked The research here https://github.com/rntz/datafun, where we start with "datalog=relational algebra+fixpoints" and then are trying to figure out how far can we take this in language where not everything is a table of facts :)

schelog by Dorai Sitaram https://ds26gte.github.io/schelog/index.html is a classic version of a funtional/declarative hybrid as a purely operational artifact. apparently its still being updated, but I remember it from many years ago (early 90s?).

it uses the ability to return multiple times from a continuation to allow sets of results (streams or solution trees) to be expressed directly in scheme and intermixed freely with the functional subset.

a 'nice' mixture of declarative and procedural would be really slick. i should go back and look at mozart/oz again.

I'll definately see if my limited ability to deal with formalisms lets me understand the type system proposed in Datafun.

"Streams of solution trees" kinda reminds me of mikro-kanren [1]. When I was using Clojure roughly 3 years ago, kanren was quite popular embedded logic.

W.r.t. Datafun, i learned about it from a talk its creator gave on strangeloop last year [2], for me it seemed quite approachable :)

[1] http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf [2] https://www.youtube.com/watch?v=gC295d3V9gE

Since DataLog is just a subset of Prolog I want to link a very good book about modern Prolog and its applications [1].

[1]. https://www.metalevel.at/prolog

While it is possible to view Datalog as a subset of Prolog, they are really very different languages. In particular Datalog implementations generally use forward rather than backwards chaining evaluation. As such, programs you might write in Datalog might diverge in naïve Prolog implementation. Depending on what variant and extensions of Datalog you are using, a program that you could write in the subset of Prolog that matches Datalog might similarly diverge in a Datalog engine.

So really linking to a text on Prolog is not really relevant to a discussion of the merits of Datalog.

Prolog and Datalog also have a lot in common: In addition to Datalog being a syntactic subset of Prolog, the semantics of Prolog and Datalog, and hence the solutions you obtain, are also identical if you enable SLG resolution in your Prolog system and run a Datalog program with it.

SLG resolution (also called "tabling" in Prolog) prevents both positive and negative loops, and always terminates for programs that have the bounded term-size property, as is the case for all Datalog programs. XSB Prolog is a notable example of a Prolog system that supports this mode, which you can enable with the table/1 directive:


Therefore, you can use a Prolog system such as XSB Prolog also to learn Datalog, and then seamlessly move to Prolog. Conversely, if you know Prolog, then learning Datalog is very easy. Another very nice system for learning Datalog is DES, which you can run in SICStus Prolog and other systems:


Various extensions such as aggregate functions and object-oriented programming (which are mentioned in the Wikipedia entry for Datalog) are also applicable to both Prolog and Datalog with identical or closely related semantics.

Yes, I'm well aware of tabling, which is why I said "naïve Prolog implementation". I'm also aware of using magic sets and demand transformations to allow writing Prolog-like relations in Datalog.

I also work on an industrial strength Datalog implementation and I can tell you unequivocally there is very little relationship with how a real Datalog engine is implemented and a Prolog implementation.

And no, learning Prolog will not help you learn Datalog. Unless you're going to always us tabling or demand transformations, the difference in evaluation model requires writing in a very different style.

Yes, such implementation differences definitely exist!

Whether one emphasizes the differences or commonalities of Prolog and Datalog may also depend on commercial interests: For vendors of Datalog systems, it may be advantageous to stress the differences. For Prolog vendors, it may conversely be advantageous to stress their commonalities.

Personally, I learned Prolog before Datalog, and it was easy for me to learn Datalog because I was already familiar with Prolog. The reason may be that I try to write as declaratively as possible when programming in Prolog, and that carried over well to Datalog too. More recently, I have also used tabling in many Prolog programs, and often obtained very general programs with good termination properties, similar to Datalog definitions.

Awesome. I look forward to browsing its code.

I implemented a Datalog engine in Java a while back. Looking back on it now there are a bunch of things I might have done differently, but I've never tried to implement anything like it before.

I remember that doing the stratified negation was quite complex and I won't be able to explain it of the top of my head anymore.

Anyway, here's my version: https://github.com/wernsey/Jatalog

I may actually have a use for this. Thanks for releasing it!

BTW the use of functional maps, such as those in [0], might have advantages over things like StackMap. Of course it would require some restructuring of the code.

[0] https://github.com/slburson/fset-java

Thank you. I'll take a look.

In retrospect I think there might be several ways to improve performance by using other data structures to index the facts, but I haven't had the time to work on it lately. There is a pull request from someone that I'm still looking at, which might just spur me on to think about it some more.

If you want a Go implementation, look at OPA https://github.com/open-policy-agent/opa - designed for authorization but quite general.

What would be some advantages of this design over the µKanren design [0] which uses a search monad [1] and expresses unification and relations directly as plain values and functions?

[0] http://webyrd.net/scheme-2013/papers/HemannMuKanren2013.pdf

[1] http://homes.soic.indiana.edu/ccshan/logicprog/LogicT-icfp20...

Those papers and systems deal with Prolog-like languages, not Datalog-like ones. These are different languages, even though there is a common set of similar-looking programs with similar-ish declarative semantics but completely different procedural semantics. In particular, it's easy to write programs that easily compute the desired results when run on a Datalog system, but that run into an infinite loop when run on a Prolog system.

Also, the solver here is special-cased to treat only a certain subset of Datalog that is much simpler than the full language.

it looks like it doesn't handle negation. one of those unfortunate but interesting features of the space of recursive declarative languages.

if you allow something to be true only if something else is not true, your solution space is no longer monotonic. meaning that a result you thought was part of the solution can later be retracted. if there is a cycle (trivially "a if not a"), then you may not ever arrive at an answer. that might seem simple to detect, but that cycle can exist through an arbitrary number of rules and data dependencies, so in the limit is undecidable.

ideally, as in this solver, you could keep adding terms you could show to be correct from the data and rules at hand until they and all their consequences have been found, and which point you can stop (fixed point). but negation screws up this plan.

there are various restrictions on negation, solver heuristics, and partial conservative proofs of termination that can be applied. but this is where your beautiful simple declarative world starts getting sad and messy.

[edit: I'd like to leave this because I wish I understood the problem with negation earlier, but I just skimmed and missed stratification, Frank thoughtfully explains below]

I'm drinking at the moment, so maybe messy but not sad.

The engine supports negation in the standard way: stratification. You can call 'from_antijoin' with a static relation for its second argument but not a variable. It is not discussed in the post because it is a bit of a diversion from the intended message (and because antijoin is computationally not much different from a filter).

If you would like a more general compute model with non-monotonic iteration, I recommend differential dataflow!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact