kmicinski's comments

kmicinski · 2025-11-25T17:09:23 1764090563

> In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.

I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.

kmicinski · 2025-11-25T04:39:10 1764045550

Respectfully, I think what you mean is that there are no books which give you the experience of hacking on LLVM for several years.

kmicinski · 2025-11-06T17:50:23 1762451423

> - I had worse performance with Soufflé than Ascent for my program for some query-planning reason that I couldn't figure out. I don't really know why; see https://github.com/souffle-lang/souffle/discussions/2557

I think the basic issue is that ADTs are simply not indexed--so to the degree that you write a query that would necessitate an index on a subtree of an ADT, you will face asymptotic blowup, as the way ADTs work will force you to scan-then-test across all ADTs (associated with that top-level tag). The issue is discussed in Section 5.2 of this paper here: https://arxiv.org/pdf/2411.14330

arxanas · 2025-11-06T18:46:59 1762454819

Ah, yes, but I think Ascent also doesn't index ADTs. In this case, based on some other information, it seems like Soufflé _can_ plan the queries better if it has profiling data. It seems like Ascent just happened to pick a better query plan in my case without the profiling data.

Thanks for the link to the paper!

kmicinski · 2025-11-06T19:04:40 1762455880

It's true that Ascent does not index ADTs either, but there are some tricks that you can use when you control the container type to get similar performance by, e.g., storing a pre-computed hash. I believe Arash, the main author of Ascent, was exploiting this trick for Rc<...> members and seeing good performance gains. It is a bit nuanced, you're right that Ascent doesn't pervasively index ADTs out of the box for sure.

kmicinski · 2025-08-12T03:15:52 1754968552

I appreciate your thoughtful criticism of the post, to my eyes everything you are saying is correct.

kmicinski · 2025-08-12T01:58:41 1754963921

Seems potentially interesting to explore what would be required to store durable continuations. Feels very related to incrementalization and provenance, as you can see materializing a continuation to disk (whatever storage backend) requiring dependence tracking to do anything other than simply snapshotting the whole program state. I am just spitballing though, not sure if anyone has actually tried this.

kmicinski · 2025-08-11T22:12:49 1754950369

I agree with you--that's a topic I will definitely cover in my blog, too. You make a good point: I know some folks who worked at big financial orgs, writing hundreds of thousands of lines of code, and never wrote general-recursive functions (only used simple recursors like foldl).

kmicinski · 2025-06-15T19:28:58 1750015738

For materialization-heavy workloads (program analysis, etc.), we often find that optimized binary join plans (e.g., profile-optimized, hand-optimized, etc.) beat worst-case optimal plans due to the ability to get better scalability (less locking) without the need to use a trie-based representation. Within the space of worst-case optimal plans, there are still lots of choices: but a bad worst-case optimal plan can often beat a bad (randomly-chosen) binary plan. And of course (the whole point of this exercise), there are some queries where every binary plan explodes and you do need WCOJ. There's also some work on making more traditional binary joins robust (https://db.in.tum.de/people/sites/birler/papers/diamond.pdf), among other interesting work (https://arxiv.org/html/2502.15181v1). Effectively parallelizing WCOJs is still an open problem as far as I am aware (at least, this is what folks working on it tell me), but there are some exciting potential directions in tackling that that several folks are working on I believe.

kmicinski · 2025-06-15T16:40:10 1750005610

No offense, but I wouldn't take Datalog 2.0's small attendance as an exemplar of Datalog's decline, even if I agree with that high-level point. Datalog 2.0 is a satellite workshop of LPNMR, a relatively-unknown European conference that was randomly held in Dallas. I myself attended Datalog 2.0 and also felt the event felt relatively sparse. I also had a paper (not my primary work, the first author is the real wizard of course :-) at the workshop. I myself saw relatively few folks in that space even attending that event--with the notable exception of some European folks (e.g., introducing the Nemo solver).

All of this is to say, I think Datalog 2.0's sparse attendance this year may be more indicative of the fact that it is a satellite workshop of an already-lesser-prestigious conference (itself not even the main event! That was ICLP!) rather than a lack of Datalog implementation excitement.

For what it's worth, none of what I'm saying is meant to rebut your high-level point that there is little novelty left in implementing raw Datalog engines. Of course I agree, the research space has moved far beyond that (arguably it did a while ago) and into more exotic problems involving things like streaming (HydroFlow), choice (Dusa), things that get closer to the general chase (e.g., Egglog's chase engine), etc. I don't think anyone disagrees that vanilla Datalog is boring, it's just that monotonic, chain-forward saturation (Horn clauses!) are a rich baseline with a well-understood engineering landscape (esp in the high-performance space) to build out more interesting theories (semirings, Z-sets, etc..).

BartjeD · 2025-06-16T06:01:47 1750053707

The USA is a hostile environment to visit, for Europeans. It's a theme all over western and northern Europe to avoid it.

If you were expecting European attendance, that would explain why it's different from the past.

It's the same in science circles, tourism and even business travel.

freilanzer · 2025-06-16T13:49:27 1750081767

As a European, I'm definitely avoiding the US. I was asked whether I wanted to visit our US branch and I declined - which was simply accepted and probably expected in advance. Almost everyone I know does the same.

kmicinski · 2025-06-16T15:56:59 1750089419

I completely agree with you, but this event was held in October, 2024--Trump's horrendous behavior wasn't an effect on that. There are other reasons folks internationally probably don't want to travel to Texas aside from Trump--but in this specific case I think it might have more to do with the fact that LPMNR is a traditionally European conference randomly held in the US.

kmicinski · on Dec 4, 2023

Yeah, there are a ton of substantively different approaches to modern Datalogs, targeting different applications.

To start off: Datalog is distinguished from traditional SQL in its focus on heavily-recursive reachability-based reasoning. With respect to expressivity, you can see Datalog as CDCL/DPLL restricted to boolean constraint propagation (i.e., Horn clauses). Operationally, you can think of this as: tight range-indexed loops which are performing insertion/deduplication into an (indexed) relation-backing data structure (a BTree/trie/etc...). In SQL, you don't know the query a-priori, so you can't just index everything--but in Datalog, you know all of the rules up-front and can generate indices for everything. This ubiquitous indexing enables the state-of-the-art work we see with Datalog in static analysis (DOOP, cclyzer), security (ddisasm), etc...

Our group targets tasks like code analysis and these big graph problems because we think they represent the most computationally-complex, hard problems that we are capable of doing. The next step here is to scale our prototypes (a handful of rules) to large, realistic systems--some potential applications of that are, e.g., raw feature extraction for binaries when you do ML over binary corpuses (which otherwise require, e.g., running IDA) on the GPU (rather than IDA on the CPU), medical reasoning (accelerating MediKanren), and (hopefully) probabilistic programming (these neuro-symbolic applications).

By contrast, I think work which takes a more traditional Databases approach (CodeQL, RDFox, ...) focus a little less on ubiquitous high-performance range-indexed insertion in a tight loop, and focus a little more on supporting robust querying and especially operating on streams. There is some very cool related work there in differential dataflow (upon which differential Datalog is built). There is a solver there named DDlog (written in Rust) which takes that approach. Our in-house experiments show that DDlog is often a constant factor slower than Souffle on GPUs, and we did not directly compare against DDlog in this paper--I expect the results would be roughly similar to Souffle.

zozbot234 · on Dec 4, 2023

> To start off: Datalog is distinguished from traditional SQL in its focus on heavily-recursive reachability-based reasoning.

This was historically true but SQL has CTE's and recursive CTE's these days, and even some extra syntactic sugar for reachability query's. And of course (given the former) "inference" and "deduction" over data are just a CREATE VIEW statement away. This is the "you know all of the rules up-front" part; a VIEW is just a query that you do know upfront. Most uses of "deductive" databases are just for querying within some in-memory database, which is not really playing in the same league as a fully general RDBMS. This is not intended to dismiss the work in OP, of course; if anything, its applicability need not be restricted to a tiny niche of software intended for specific application areas, and can be quite a bit broader than that.

kmicinski · on Dec 5, 2023

Right--but CTEs are orders-of-magnitude slower than Datalog, to the point that they are not seriously worth considering for any modern application where Datalog would be used. As you said, Datalog is less expressive than SQL's ability to create new queries in an ad-hoc way: knowing all queries within the fixed point enables much more efficient compilation.

westurner · on Dec 5, 2023

"Introduction to Datalog" re: Linked Data https://news.ycombinator.com/context?id=34808887

pyDatalog/examples/SQLAlchemy.py: https://github.com/baojie/pydatalog/blob/master/pyDatalog/ex...

GH topics > datalog: https://github.com/topics/datalog

datalog?l=rust: https://github.com/topics/datalog?l=rust ... Cozo, Crepe

Crepe: https://github.com/ekzhang/crepe :

> Crepe is a library that allows you to write declarative logic programs in Rust, with a Datalog-like syntax. It provides a procedural macro that generates efficient, safe code and interoperates seamlessly with Rust programs.

Looks like there's not yet a Python grammar for the treeedb tree-sitter: https://github.com/langston-barrett/treeedb :

> Generate Soufflé Datalog types, relations, and facts that represent ASTs from a variety of programming languages.

Looks like roxi supports n3, which adds `=>` "implies" to the Turtle lightweight RDF representation: https://github.com/pbonte/roxi

FWIW rdflib/owl-rl: https://owl-rl.readthedocs.io/en/latest/owlrl.html :

> simple forward chaining rules are used to extend (recursively) the incoming graph with all triples that the rule sets permit (ie, the “deductive closure” of the graph is computed).

ForwardChainingStore and BackwardChainingStore implementations w/ rdflib in Python: https://github.com/RDFLib/FuXi/issues/15

Fast CUDA hashmaps

Gdlog is built on CuCollections.

GPU HashMap libs to benchmark: Warpcore, CuCollections,

https://github.com/NVIDIA/cuCollections

https://github.com/NVIDIA/cccl

https://github.com/sleeepyjack/warpcore

/? Rocm HashMap

DeMoriarty/DOKsparse: https://github.com/DeMoriarty/DOKSparse

/? SIMD hashmap

Google's SwissTable: https://github.com/topics/swisstable

rust-lang/hashbrown: https://github.com/rust-lang/hashbrown

CuPy has array but not yet hashmaps, or (GPU) SIMD FWICS?

NumPy does SIMD: https://numpy.org/doc/stable/reference/simd/

google/highway: https://github.com/google/highway

xtensor-stack/xsimd: https://github.com/xtensor-stack/xsimd

GH topics > HashMap: https://github.com/topics/hashmap

kmicinski · on Dec 5, 2023

Yeah, IncA (compiling to Souffle) and Ascent (I believe Crepe is not parallel, though also a good engine) are two other relevant cites here. Apropos linked data, our group has an MPI-based engine which is built around linked facts (subsequently enabling defunctionalization, ad-hoc polymorphism, etc..), which is very reminiscent of the discussion in the first link of yours: https://arxiv.org/abs/2211.11573

westurner · on Dec 5, 2023

TIL about Approximate Reasoning.

"Approximate Reasoning for Large-Scale ABox in OWL DL Based on Neural-Symbolic Learning" (2023) > Parameter Settings of the CFR [2023 ChunfyReasoner] and NMT4RDFS [2018] in the Experiments. https://www.researchgate.net/figure/Parameter-Settings-of-th...

"Deep learning for noise-tolerant RDFS reasoning" (2018) > NMT4RDFS: http://www.semantic-web-journal.net/content/deep-learning-no... :

> This paper documents a novel approach that extends noise-tolerance in the SW to full RDFS reasoning. Our embedding technique— that is tailored for RDFS reasoning— consists of layering RDF graphs and encoding them in the form of 3D adjacency matrices where each layer layout forms a graph word. Each input graph and its entailments are then represented as sequences of graph words, and RDFS inference can be formulated as translation of these graph words sequences, achieved through neural machine translation. Our evaluation on LUBM1 synthetic dataset shows 97% validation accuracy and 87.76% on a subset of DBpedia while demonstrating a noise-tolerance unavailable with rule-based reasoners.

NMT4RDFS: https://github.com/Bassem-Makni/NMT4RDFS

...

A human-generated review article with an emphasis on standards; with citations to summarize:

"Why do we need SWRL and RIF in an OWL2 world?" [with SPARQL CONSTRUCT, SPIN, and now SHACL] https://answers.knowledgegraph.tech/t/why-do-we-need-swrl-an...

https://spinrdf.org/spin-shacl.html :

> From SPIN to SHACL In July 2017, the W3C has ratified the Shapes Constraint Language (SHACL) as an official W3C Recommendation. SHACL was strongly influenced by SPIN and can be regarded as its legitimate successor. This document explains how the two languages relate and shows how basically every SPIN feature has a direct equivalent in SHACL, while SHACL improves over the features explored by SPIN

/? Shacl datalog https://www.google.com/search?q=%22shacl%22+%22datalog%22

"Reconciling SHACL and Ontologies: Semantics and Validation via Rewriting" (2023) https://scholar.google.com/scholar?q=Reconciling+SHACL+and+O... :

> SHACL is used for expressing integrity constraints on complete data, while OWL allows inferring implicit facts from incomplete data; SHACL reasoners perform validation, while OWL reasoners do logical inference. Integrating these two tasks into one uniform approach is a relevant but challenging problem.

"Well-founded Semantics for Recursive SHACL" (2022) [and datalog] https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A169...

"SHACL Constraints with Inference Rules" (2019) https://arxiv.org/abs/1911.00598 https://scholar.google.com/scholar?cites=1685576975485159766...

Datalog > Evaluation: https://en.wikipedia.org/wiki/Datalog#Evaluation

...

VMware/ddlog: Differential datalog

> Bottom-up: DDlog starts from a set of input facts and computes all possible derived facts by following user-defined rules, in a bottom-up fashion. In contrast, top-down engines are optimized to answer individual user queries without computing all possible facts ahead of time. For example, given a Datalog program that computes pairs of connected vertices in a graph, a bottom-up engine maintains the set of all such pairs. A top-down engine, on the other hand, is triggered by a user query to determine whether a pair of vertices is connected and handles the query by searching for a derivation chain back to ground facts. The bottom-up approach is preferable in applications where all derived facts must be computed ahead of time and in applications where the cost of initial computation is amortized across a large number of queries.

From https://community.openlinksw.com/t/virtuoso-openlink-reasoni... https://github.com/openlink/virtuoso-opensource/issues/660 :

> The Virtuoso built-in (rule sets) and custom inferencing and reasoning is backward chaining, where the inferred results are materialised at query runtime. This results in fewer physical triples having to exist in the database, saving space and ultimately cost of ownership, i.e., less physical resources are required, compared to forward chaining where the inferred data is pre-generated as physical triples, requiring more physical resources for hosting the data.

FWIU it's called ShaclSail, and there's a NotifyingSail: org.eclipse.rdf4j.sail.shacl.ShaclSail: https://rdf4j.org/javadoc/3.2.0/org/eclipse/rdf4j/sail/shacl...

kmicinski · on Dec 4, 2023

Sounds awesome--feel free to get in touch with us (the authors of this paper) and share your progress. We have a similar single-node Datalog engine in Rust, it would be cool to benchmark your results against parallel Ascent (https://github.com/s-arash/ascent).

cmrdporcupine · on Dec 4, 2023

Cool, I will try to carve out some time to look at that, too!