More

mbid · 2024-11-20T19:11:00 1732129860

The post mentions the idea that querying a database D can be understood algebraically as enumerating all morphisms Q -> D, where Q is the "classifying" database of the query, i.e. a minimal database instance that admits a single "generic" result of the query. You can use this to give a neat formulation of Datalog evaluation. A Datalog rule then corresponds a morphism P -> H, where P is the classifying database instance of the rule body and H is the classifying database instance for matches of both body and head. For example, for the the transitivity rule

  edge(x, z) :- edge(x, y), edge(y, z).

you'd take for P the database instance containing two rows (a_1, a_2) and (a_2, a_3), and the database instance H contains additionally (a_1, a_3). Now saying that a Database D satisfies this rule means that every morphism P -> D (i.e., every match of the premise of the rule) can be completed to a commuting diagram

  P --> D
  |    ^
  |   /
  ⌄  /
  Q

where the additional map is the arrow Q -> D, which corresponds to a match of both body and head.

This kind of phenomenon is known in category theory as a "lifting property", and there's rich theory around it. For example, you can show in great generality that there's always a "free" way to add data to a database D so that it satisfies the lifting property (the orthogonal reflection construction/the small object argument). Those are the theoretical underpinnings of the Datalog engine I'm sometimes working on [1], and there they allow you to prove that Datalog evaluation is also well-defined if you allow adjoining new elements during evaluation in a controlled way. I believe the author of this post is involved in the egglog project [2], which might have similar features as well.

[1] https://github.com/eqlog/eqlog [2] https://github.com/egraphs-good/egglog

snthpy · 2024-11-21T05:13:10 1732165990

Thank you @xlinux and @mbid. Very interesting and not something I knew much about before.

I had a look at eglog and egglog and if I'm understanding things correctly then one possible use case is type inference and optimization. I'm particular I looked at the example in [1].

I'm thinking that this could be useful in the PRQL [2] compiler, in particular for: a) inference of type restrictions on input relations and resultant output relation types, b) optimization of resultant SQL queries.

Would you be able to comment on whether that's correct?

Any links to related examples, papers, or work would be appreciated. Thanks!

1: https://egglog-python.readthedocs.io/latest/tutorials/sklear...

2: https://www.prql-lang.org/

mbid · 2024-11-21T11:49:14 1732189754

I actually started working on Eqlog because I wanted to use it to implement a type checker. You might want to skim the posts in my series on implementing a Hindley-Milner type system using Eqlog, starting here [1]. The meat is in posts 3 - 5.

The type checker of Eqlog is mostly implement in Eqlog itself [2]. The general idea is that your parser populates a Database with syntax nodes, which are represented as `...Node` types in the Eqlog program at [2], and then you propagate type information with Datalog/Eqlog evaluation. Afterwards, you check whether the Database contains certain patterns that you want to rule out, e.g. a variable that doesn't have a type [3].

There are still some unsolved problems if you're interested in writing the whole type checker in Datalog. For example, variable lookup requires quadratic memory when implemented in Datalog. I mention this and a possible solution at [4]. However, Datalog as is can probably still be useful for some subtasks during type checking. For example, the Rust compiler uses Datalog in some parts of the type checker I believe. Reach out via e.g. github to mbid@ if you'd like to discuss in more detail.

Regarding optimization you probably want to talk with somebody working with egglog, they have a dedicated Zulip [5]. I'd imagine that for prql you want to encode the algebraic rules of pipeline transformations, e.g. associativity of filter over append. Given the query AST, eqlog or egglog would give you all equivalent ways to write the query according to your rules. You'd then select the representation you estimate to be the most performant based on a score you assign to (sub)expression.

[1] https://www.mbid.me/posts/type-checking-with-eqlog-parsing/

[2] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...

[3] https://github.com/eqlog/eqlog/blob/9efb4d3cb3d9b024d681401b...

[4] https://www.mbid.me/posts/dependent-types-for-datalog/#morph...

[5] https://egraphs.zulipchat.com

snthpy · 2024-11-22T14:08:18 1732284498

Thank you. Will try to get into this on the weekend. I'll reach out once I can ask a more informed question.

bubblyworld · 2024-11-21T05:36:28 1732167388

Very interesting perspective I hadn't heard before on datalog, thanks. How far does it go - can you interpret extensions of datalog (say negation or constrained existentials) in a nice categorical way, for instance? I've given this very little thought but I imagine you'd have issues with uniqueness of these "minimal" database instances, and I'm not sure what that means for these lifting properties.

(if my question even makes sense, pardon the ignorance)

mbid · 2024-11-21T12:03:54 1732190634

If you're interested in the details, you might want to have a look at papers [1] or [2].

You can add existentials in this framework, which basically means that the lifting problems mentioned above don't need to have unique solutions. But as you say, then the "minimal" databases aren't determined uniquely up to isomorphism anymore. So the result of Datalog evaluation now depends on the order in which you apply rules.

If I recall correctly, then [3] discusses a logic corresponding to accessible categories (Datalog + equality corresponds to locally presentable categories) which includes the the theory of fields. The theory of fields involves the negation 0 != 1, so perhaps that might give you a nicer way to incorporate negations without stratification.

[1] https://www.mbid.me/eqlog-semantics/

[2] https://arxiv.org/abs/2205.02425

[3] Locally presentable and accessible categories, https://www.cambridge.org/core/books/locally-presentable-and...

bubblyworld · 2024-11-21T16:51:21 1732207881

Thanks for the references, those papers looks great! Will dig into them this evening =)

mbid · on Jan 26, 2024

For me, the main problem with most tools that render to HTML was that they don't support all math typesetting libraries that latex supports. I used to work with category theory, where it's common to use the tikz-cd library to typeset commutative diagrams. tikz-cd is based on tikz, which is usually not supported for HTML output.

But apart from math typesetting, my latex documents were usually very simple: They just used sections, paragraphs, some theorem environments and references to those, perhaps similar to what the stack project uses [3]. Simple latex such as this corresponds relatively directly to HTML (except for the math formulas of course). But many latex to html tools try to implement a full tex engine, which I believe means that they lower the high-level constructs to something more low level (or that's at least my understanding). This results in very complicated HTML documents from even simple latex input documents.

So what would've been needed for me was a tool that can (1) render all math that pdflatex can render, but that apart from math only needs to (2) support a very limited set of other latex features. In a hacky way, (1) can be accomplished by simply using pdflatex to render each formula of a latex document in isolation to a separate pdf, then converting this pdf to svg, and then incuding this svg in the output HTML in the appropriate position. And (2) is simply a matter of parsing this limited subset of latex. I've prototyped a tool like that here [1]. An example output can be found here [2].

Of course, SVGs are not exactly great for accessibility. But my understanding is that many blind mathematicians are very good at reading latex source code, so perhaps an SVG with alt text set to the latex source for that image is already pretty good.

[1] https://github.com/mbid/latex-to-html

[2] https://www.mbid.me/lcc-model/

[3] https://stacks.math.columbia.edu/

ykonstant · on Jan 26, 2024

Tangentially, for me the stacks project is the gold standard of mathematical typography on the web. Look at this beauty: https://stacks.math.columbia.edu/tag/074J

Also check the diagrams: https://stacks.math.columbia.edu/tag/001U

If anyone can explain to me, a complete noob regarding html, how they achieve this result with html, css and whichever latex engine they use, I would be grateful. I want to make a personal webpage in this style.

artagnon · on Jan 26, 2024

It's standard MathJaX that's rendered client-side. I managed to get MathJaX + XyPic rendered server-side on my website, which is a lot nicer.

ykonstant · on Jan 26, 2024

Oh, you misunderstand the level of my question; rephrased, how do maek wabpag with "MathJaX that's rendered client-side"? (o´▽`o)

red_trumpet · on Jan 26, 2024

Take a look at MathJax's website: https://www.mathjax.org/#gettingstarted

They have a link to JSBin which contains an easy example html page.

ykonstant · on Jan 26, 2024

Thanks!

dolmen · on Jan 26, 2024

uMatrix tells me there are 8 external sites to grant permissions for access to resources. Definitely not a "beauty".

ykonstant · on Jan 26, 2024

I don't understand what this has to do with typography.

datadeft · on Jan 26, 2024

Have you seen typst? I have moved over from LaTex to Typst and most if not all your use cases are covered.

https://typst.app/

_flux · on Jan 26, 2024

Except the main theme, which was HTML export? https://github.com/typst/typst/issues/721

Though it's in the roadmap!

mbid · on Oct 14, 2023

If you're going to send out math as SVGs anyway, you can also just use your normal latex to PDF renderer (e.g. pdflatex) on each formula, and then convert the output PDFs to SVGs. That way, you get the same output you'd get with latex, and you can also use latex packages that aren't supported by MathJax (e.g. tikz-cd). I've prototyped a latex to html converter [1] based on that approach, but it's probably not ready for serious use. Here's an example: https://www.mbid.me/lcc-model/

[1] https://github.com/mbid/latex-to-html

mbid · on Oct 14, 2023

> using KaTeX [...] switched to server-side rendering with MathJax

I've been meaning to look into KaTex. Could you elaborate on why you switched away from it? KaTeX appears to support server-side rendering already, in the sense that it generates plain HTML.

batterylow · on Oct 16, 2023

> Could you elaborate on why you switched away from it?

I started using KaTeX sometime after 2015 because it promised to be fast (the fastest! [1]). I had to change the representation of a bunch of expressions because KaTeX didn't support some environments, whilst MathJax did. It was a trade-off I was willing to accept at the time.

Many years later, I started writing a personal static-site generator. I wanted comparatively lightweight pages, so rendering server-side was an option. I re-evaluated MathJax vs KaTeX again and this time I leaned towards MathJax, as speed was no longer an issue for me. It looks like KaTeX has broader support now [2].

[1] https://katex.org

[2] https://katex.org/docs/support_table.html

atlintots · on Oct 14, 2023

MathJax has better MathML support.

mbid · on Aug 14, 2023

The rename system call is not quite atomic. From https://linux.die.net/man/2/rename:

> However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed.

A better way to think about rename is that it's given by two atomic operations in sequence: (1) An atomic link call to make the file available under the new name, and then (2) an atomic remove operation to delete the old name. Thus, it's possible to observe the state after (1) but before (2).

jakewins · on Aug 14, 2023

Dan Luu also says rename is not atomic, at the bottom of this post: https://danluu.com/file-consistency/

I've wondered ever since I read that what the first bullet there means. It can't be what you're saying here, what you're saying would apply to racing processes as well as crashes?

So then.. what does he mean when he says rename isn't atomic if there's a crash? What's the failure condition / in-between state that can occur if there's a crash?

mbid · on Aug 14, 2023

>what does he mean when he says rename isn't atomic if there's a crash?

Not sure. One of the papers he cites [1] has this to say about rename atomicity:

>Directory operations such as rename() and link() are seemingly atomic on all file systems that use techniques like journaling or copy-on-write for consistency

So it seems to depend on the file system. But the authors claim that not just link but also rename is atomic without qualification, which appears to be false, so I'm not sure how trustworthy this is.

There's a stackoverflow comment [2] that suggests that ext4 guarantees the kind of sequential atomicity property (atomic link followed by atomic remove) I mentioned:

>Kernel implementation depends on filesystem but here's implementation of Linux ext4 filesystem: elixir.bootlin.com/linux/v5.19.3/source/fs/ext4/namei.c#L3887 – it seems to first mark the inode as dirty, then create new link and after that delete the old link. If the kernel crashes in the middle, the end result could be two links and dirty mark. I would guess (but didn't investigate if this true) that journal recovery during the next mount would fix the end result to match atomic behavior of leaving either the old or new state depending on exact crash location.

In general, I find it rather weird how little authoritative documentation there is about whether or not this property holds given how important it is.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14...

[2] https://stackoverflow.com/questions/7054844/is-rename-atomic...

fao_ · on Aug 14, 2023

> I find it rather weird how little authoritative documentation there is about whether or not this property holds given how important it is.

Well, it can't be that important then, can it? :)

remram · on Aug 14, 2023

Usually the atomicity I care about is with regard to newpath: I either observe no file/previous file at newpath, or I observe the new file at newpath. This is in opposition to a copy for example, where I might observe a partially-written file.

This is a great point though, when taking oldpath into account the operation is not atomic. TIL

fsckboy · on Aug 14, 2023

the type of atomicity I've always seen it used for is, two processes try to rename a file at the same time, only one will succeed; the winner does what they planned and then rename the file back.

I guess you shouldn't ignore the error with both processes using the same destination name, why folks generally tend to put their pid in the lockfilename

yencabulator · on Aug 25, 2023

Umm, two concurrent rename(2) syscalls to the same destination path will likely both succeed. One of them will win (not be overwritten), but both should succeed.

If you want atomic creation of a path, where no one else can succeed also, you want link(2) not rename(2).

mbid · on Aug 14, 2023

This is also the natural solution when you're using Datalog to compute with the AST: Datalog operates on relations/tables, so associating types to AST nodes can be accomplished using a table with two columns: One for the ID representing an AST node, and one for the type associated to the AST node. I've recently written about this approach in the series starting here: https://www.mbid.me/posts/type-checking-with-eqlog-parsing/

mbid · on Oct 2, 2021

>Pompom is an attractive implementation of an extensional (!) dependently typed language

>Pompom provides [...] a strong normalization system

How is this possible? Extensional dependent type theory has undecidable term/type equality, so there cannot be a strong normalization algorithm.

caotic123 · on Oct 2, 2021

Oh, it is extensional in the sense of supporting K axiom, actually (not identity as propositional <-> definitional) :).

caotic123 · on Oct 2, 2021

As far i know, pompom has a decidable type check though.

mbid · on March 27, 2019

Just to add to the confusion: Monoidal categories are the appropriate notion of "monoid object" in the category of categories; unit and associativity law are replaced by coherent natural isomorphisms (see "categorification"). So that's where the "monoidal" comes from.

mbid · on July 30, 2018

You're describing my pet peeve with research in programming languages, or more generally use of abstractions in CS. What you're writing makes sense: Understand the structure of your problem, then think of similar problems you've seen before, finally try to abstract away details that are irrelevant for the solutions in each case. If you're lucky, all of these problems are an instance of one general one, and need to be solved only once. In your example, it's that IO and [] are both monads and thus share some structure.

What's often happening in PL theory, though, is that people start off with axioms/syntax and then mutilate the actual problem until it fits the syntax (still badly). That's how you get ridiculous stuff like "everything is an object" or "everything is a file". In PL theory, people will usually first make up some syntax/theory and then search for models. If physicists worked the same way, they'd write down random formulae and would then go out to find phenomena described by them. You'd probably read this comment on a cave wall.

JoshCole · on July 30, 2018

Axiomatic formulations of a problem do not have the same goals as abstractions. They can appear similar on the surface, but one has to do with the laying of a foundation and the other is about problem simplification. One is about allowing great leaps in understanding after a great deal of derivation, the other about the reduction of work via problem equivalence. Criticizing someone for using an axiomatic formulation rather than a proper abstraction in programming language theory is like criticizing an architect for using pencil in laying his foundation on paper rather the much more appropriate cement at the construction site.

You don't get everything is an X because of axiomatic formulation. You get it for the same sort of reason that instead of spending ages describing the configurations of atoms in front of you, you throw all that out and call it a screen. Is everything still atoms? Yes! But we can call it a screen and that makes things massively easier to reason about and so we do it, because we like to reason well. But there are still benefits to thinking the other way, thinking from the very foundations. They are just different benefits, which is why we choose to think of problems from more than one perspective.

mbid · on July 30, 2018

I didn't mean to critize all axiomatization but the concrete axiomatizations arrived at by PL research. In my opinion, one should start with a thorough understanding of the problem domain, and then try to come up with a syntax that allows expression of your intuition as faithfully as possible. Form follows function. Instead, I feel like PL research is often too focused on the concrete syntax considered, thus function follows form. Case in point: Building models for the syntax at hand is often an afterthought, done only to prove consistency. I'd argue that one should start with as precise an understanding of the intended models as possible, and then come with up with an axiomatization (thus, syntax) for the intended semantics.

For example, the notion of elementary topos has been invented because its creators wanted to capture the way Grothendieck toposes kind of behave like constructive sets. This I find very useful, and also the axiomatizations of elementary toposes. On the other hand, Martin-Löf type theory didn't have a formal semantics at first, then an erroneous one, and finally ~20 years later a kind of acceptable one. And its category of models is... not really interesting. Except for categories of assemblies, I don't know of a single model of ML type theory that's not also an elementary topos. And the interesting thing about assemblies is that they can be turned into objects of the associated topos... so yeah.

joel_ms · on July 31, 2018

And yet MLTT led to Coq, Agda, Idris and Lean, while your ”PL practice” approach sounds like it would lead to, well, Go.

theaeolist · on July 31, 2018

The "everything is an X" is what I call "foundationalism". It is a disease that started with Turing and Church, except that when they started it it was not a disease. But ever since people forgot that Y and "an encoding of Y as X" is not exactly the same thing. For some purposes it might be, but not for all purposes.

marcosdumay · on July 30, 2018

Unsurprisingly, Physics get to decide what abstractions they use the same way developers decide what languages to create. They invent some notation, start describing everything they know on that notation, and if it's productive push it further and further until it breaks.

fooker · on July 30, 2018

>If physicists worked the same way, they'd write down random formulae and would then go out to find phenomena described by them.

So, String Theory?

mbid · on July 30, 2018

Kind of, from what I (complete physics noob) hear.

mbid · on July 20, 2018

However, using a Hilbert Curve for sharding doesn't seem like the best approach.

Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.

EDIT: k-means is adapted directly here: https://elki-project.github.io/tutorial/same-size_k_means

n4r9 · on July 20, 2018

In my experience the linked algorithm behaves quite poorly when the dataset and number of desired clusters becomes large. I think the issue is that it only allows for pairwise switching between clusters, and this ends up with a lot of points still being assigned to clearly the wrong cluster. Some day I would like to try more complex neighbourhood searches with it and see if it helps.