That remark is actually more interesting than you think. As groundbreaking as it was, algorithm W iss far too slow for non-toy languages. All modern HM languages (that I know of) use some form of union-find trickeries, as pioneered by the one presented in the blog post (but also present in resolution-by-constraints approaches employed by Haskell and Scala).
So, in fact, it's actually never algorithm W in non-toy languages. ;)
Side note: this article is originally from 2013 and is considered a must-read by any would-be hackers trying to modify the OCaml typechecker (it's cited in the documentation).
In fact, those union-find trickeries come from the same paper that presented algorithm W, where they were named algorithm J. W was known from the start to be more useful for proofs than implementation:
> As it stands, W is hardly an efficient algorithm; substitutions are applied too often. It was formulated to aid the proof of soundness. We now present a simpler algorithm J which simulates W in a precise sense.
Isn't the union-find trickery "just" standard almost-linear unification as discovered by Paterson-Wegman and Martelli-Montanari around 1976? Or is there more to it in the setting of Algorithm J?
I'm actually a bit surprised that it took so long to discover these formulations of unification. I wonder what Prolog systems were doing at the time, given the importance of efficient unification.
Paterson-Wegman and Martelli-Montanari are worst-case linear, but Algorithm J just uses the earlier almost-linear union-find approach: unification variables are represented as mutable pointers to either a) nothing, or b) another type (which may itself be the "parent" unification variable in the union-find structure). Unification is then just a recursive function over a pair of types that updates these pointers as appropriate- and often doesn't even bother with e.g. path compression.
The Wikipedia articles claims that W is efficient, but only for a core language without highly desirable features like recursion, polymorphism, and subtyping.
Actually, there is a rather pervasive notion of "soundness" in programming language theory. It has many flavors (because, as you hint, programming languages are quite varied). One of the most simple way of stating it, often dubbed "progress and preservation" is: a well typed program can execute (i.e., it doesn't crash) and returns a value of the same type.
This obviously covers type safety, but also spatial memory safety, and can be extended to temporal memory safety, and even some concurrency aspects.
This has been proven for fairly large subsets of programming languages and even some full languages (like SML). Unfortunately, it doesn't hold for most mainstream ones, because programming language theory is often ... under-used ... among language designers.
https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.
Giuseppe Castagna is a very well known PL researcher, and very seasoned when it comes to gradual types and "structural" types (the static version of duck-typing), which seems like a great fit for Elixir.
ocaml-re[1] uses a derivative-style construction to lazily build a DFA. The general idea is to use something similar to Owens et al's DFA construction, but doing it inline with some caching, to compile lazily (and building a thomson-like automaton, for group capture). In practice, it is fairly fast, although not as optimized as the Rust crate. :)
Derivatives supports several match semantics very easily (ocaml-re does longest, shortest, greedy, non-greedy, first). It indeed doesn't handle unicode matching though (it's possible, there is a prototype implem, but nobody took the time to push it through).
Note that it's not difficult to (lazily or not) build a NFA using derivatives as well (with Antimirov's construction).
Oh nice! Unicode is definitely something that's on my mind when thinking about derivatives and how to deal with them, but it sounds like ocaml-re is doing pretty well outside of Unicode. I would love to hook it up to my benchmark harness. (It isn't public yet... Hopefully soon. But it supports regexes in any languages. So far I have Rust, C, C++, Python and Go. I hope to add .NET, Perl and Node at least. But this might be a cool addition too.)
If anyone wants to add this Ocaml engine to the harness (or any other engine), please email me at jamslam@gmail.com and I'll give access to the repo. The only reason it isn't public yet is because I'm still working on the initial release and iterating. But it's close enough where other people could submit benchmark programs for other regex engines.
I don't think you should be worried about Unicode in particular. Although the derivation formula on paper is parameterized by a character, you don't have to compute the derivative for every character separately.
It's actually easy to compute classes of characters that have the same derivative (it's done in the linked "Regular-expression derivative re-examined" paper, although their particular implementation favors simplicity over efficiency), and it's not even necessary when using Antimirov's partial derivatives.
Actually, the complexity of the derivation is independent of the size of the alphabet. You could even define derivation on an arbitrary semi-lattice, not necessarily a set of characters. (Or a boolean algebra if you care about negation/complementation).
The difficulty in handling unicode has more to do with the efficiency of the automaton representation and manipulation rather than in turning the RE in an NFA or DFA.
Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)
It has been a long time since I read the "Regular-expression derivative re-examined" derivative paper. Mostly the only thing I remember at this point is that I came away thinking that it would be difficult to adapt in practice for large Unicode classes. But I don't remember the details.
It is honestly very difficult for me to translate your comment here into an actionable implementation strategy. But that's probably just my inexperience with derivatives talking.
> Does there exist a regex engine I can try that uses derivatives and supports large Unicode classes and purports to be usable for others? :-)
I don't know any besides ocaml-re that Drup already linked, sorry :).
And sorry that my comment is hard to decipher. I think the core point is that the "character set" can be an abstract type from the point of view of the derivation algorithm.
So it doesn't matter how they are represented, nor "how big" a character set is.
With Antimirov's derivative (which produces an NFA), there is no constraint on this type.
With Brzozowski's derivative, you need at least the ability to intersect two character sets. So the type should implement a trait with an intersection function (in Rust syntax, `trait Intersect fn intersect(self, Self) -> Self`). That's necessary for any implementation generating a DFA anyway.
And if you also want to deal with complementation, then a second method `fn negate(self) -> Self` is necessary.
Thanks! You might be right. I'm probably at a point where I'd have to actually go out and try it to understand it better.
I do wonder if there is some room for derivatives in a meta regex engine (like RE2 or the regex crate). For example, if it let you build a DFA more quickly (in practice, not necessarily in theory), then you might be able to use it for a big subset of cases. It's tricky to make that case over the lazy DFA, however, a full DFA has more optimization opportunities. For example, identifying states with very few outgoing transitions and "accelerating" them by running memchr (or memchr2 or memchr3) on those outgoing transitions instead of continuing to walk the automaton. It's really hard to do that with a lazy DFA because you don't really compute entire states up front.
I think what you suggest is possible, derivation might even be well suited for this application, however I can't tell if it would be better than existing approaches.
There are some chances that it might be interesting in practice, since it seems that this application of derivatives has not been much studied, but that's highly speculative.
Having a good quality and curated regex benchmarks would be quite useful! I hope you plan on having several features, and engines that can only have partial support. That would make for very interesting comparisons.
It does. And more. The only thing you have to do is provide a short program that parses the description of the benchmark on stdin, and then output a list of samples that consist of the time it took to run a single iteration and the "result" of the benchmark for verification. The harness takes over from there. There's no need to have any Unicode support at all. I even have a program for benchmarking `memmem`, which is of course not a regex engine at all.
`ocaml-ctypes` currently supports "reverse-bindings" (making OCaml functions available from C) out-of-the-box and mostly takes care of the intersection you are talking about, so this already works quite well.
The only gain from emiting C code is portability to weird architecture that would be covered by C compilers but not the OCaml one; which is arguably a pretty niche use-case.
I would encourage you not to use DOIs for software. They are not made for this, and have limitations which are not appropriate.
Instead, use Software Heritage https://www.softwareheritage.org/ , it provides unique identifiers and actually understand repositories, versioning, history, etc.
It also allows you to cite the software and even give proper, durable links to point to the code.
I have a general question regarding type theory. I'd like to learn more about it to be able to read and understand papers like these. From what I've read, Types and Programming Languages by Benjamin C. Pierce would be a good the place to start. Is this true? I don't see any mention of linear types in it, is this something you learn by reading papers?
Linear Types are fairly new thing, you can read about them in constructivist logic (under the subset of 'linear logic', or 'resource-aware logic' or from newer type theory stuff under 'linear types').
The conference is indeed a spoof, but in so far as what Mathematicians call a "proof" - the paper contains one. Agda is a proof assistant in the spirit of the Calculus of Constructions ( https://en.wikipedia.org/wiki/Calculus_of_constructions ).
So is the joke on Computer Scientists or Mathematicians? You decide ;)
Beware of bugs in the above code; I have only proved it correct, not tried it --Donald Knuth
Sigbovik's jokes are of the kind where the premise is completely bonkers. The rest of the development is made with the utmost rigor to highlight said bonkersitude, Reductio ad absurdum.
The wide adoption of Flask as Python's backend development framework of choice makes it quite clear that software developers have a hard time picking up April fool's jokes.
My understanding is that the difference between fuzz testing and property testing is how the input is crafted. Both can be viewed as a pair of things: a function to generate series of bits as input, and a way to turn these bits into the appropriate data under test.
Property testing generates these bits using a specified distribution, and that's about it. Fuzz testing generates these bits by looking at how the program is executed, and uses a black box to try to explore all paths in the program.
Most libraries for property testing comes with very convenient ways to craft the "input to data" part. Fuzz tools come with an almost magically effective way to craft interesting inputs. The two combines very well (and have been combined in several libraries).
Yep, but in the dual, they have the same goal: produce interesting inputs to the program which might exhibit trouble.
This is why you can use one of the approaches to help the other side of the approach.
The 3rd solution is concolic testing: use an SMT/SAT solver to flip branches. The path down to the branch imposes a formula. By inverting the formula, we can pick a certain branch path. Now you ask the SMT solver to check there there's no way to go down that branch. If it finds satisfiability, you have a counterexample and can explore that path.
Being thematically similar is not that interesting, if one works better than the others.
Coverage guided fuzzing is eating symbolic execution and concolic testing for breakfast. It isn't even close. As much as I love these more principled approaches, the strategy of throwing bits at programs and applying careful coverage metrics is just way more effective, even for cases that seem to be hand picked for SMT-based approaches to win.
> Property testing generates these bits using a specified distribution, and that's about it
I think most property testing frameworks also come with the concept of "shrinkage", which is a way to walk back a failed condition to try to find the "minimum requirement of failure". Though I am sure there are PT frameworks that haven't implemented this.
If memory serves right, AFL might have some support for something like shrinking?
Of all the property based testing libraries, hypothesis has some great ideas on shrinking. (By comparison, Haskell's QuickCheck is actually pretty bad at shrinking.)
Fuzz testing doesn't mean that the program execution is monitored in any more advanced way than detecting crashes. The fancy execution path feedback stuff came about relatively recently in the history of fuzzing.
So, in fact, it's actually never algorithm W in non-toy languages. ;)
Side note: this article is originally from 2013 and is considered a must-read by any would-be hackers trying to modify the OCaml typechecker (it's cited in the documentation).