This language of matching brackets is also called the Dyck language and its syntactic monoid is the bicyclic semigroup, which is an interesting example of an inverse semigroup. Edward Kmett gave a talk on programming with these sorts of algebraic structures that some might find illuminating: https://www.youtube.com/watch?v=HGi5AxmQUwU
Inverse semigroups are used in the geometry of interaction to give a parallelizable interpretation of lambda calculus. I don't have a great introductory reference. Here's some people who have implemented a system based on the theory: https://www.cambridge.org/core/journals/mathematical-structu...
Sorry for the citation dump, I'm short on time and I thought they would be interesting to those who found this submission interesting.
The Dyck language I was already aware of; I linked it in the previous post in the series. But the bicyclic semigroup was new to me. Yes, the bones of it are very similar to my "stack monoid;" if you erase the actual element values of the stack contents and just count the lengths, it's clearly the same thing. No doubt there's some theory that justifies adding the elements back in as well.
I'll have to check out the Kmett talk - from the description it sounds very interesting.
Although I didn't know it by that name, I have seen the bicyclic semigroup in competitive programming many times before (usually any problem involving intervals/brackets and segment trees).
For example in this problem https://codeforces.com/contest/1285/problem/E, one solution is to parse and count the number of top-level parentheses after small modifications. For example the original string might be "(()())()". Then "(()())" is 1, "(())()" is 2, "()()()" is 3.
This is solved with the following monoid to make incremental reparsing fast:
Monoid cached trees are extremely popular in competitive programming, but never by that name. It's always referred to as a "segment tree" without using any mathematical terminology more advanced than "associativity".
Due to its popularity, all kinds of crazy monoid states have already been explored (though you need to squint a bit to see them due to the terminology mismatch): https://codeforces.com/blog/entry/15890. I've never seen stack used though. You almost always want to compress the state as much as possible and the problems I've seen only needed stack depth (aka the bicyclic semigroup), not contents. Requiring O(N) to merge also highly limits its use cases.
Anyway, not sure how relevant this is to you since the use of monoids in data structures and for parallelism is slightly different (binary merges all the way down versus stopping at some chunk size).
> Automatically generating monoids such as the stack monoid from their corresponding simple sequential programs seems like an extremely promising area for research.
I was interested in this problem a while back and the papers I remembered being useful were:
The motivating example is whether you can generate a mergesort (where the monoid is combine two sorted lists) given an implementation of an insertion sort (where the sequential program is inserting one element by one into a sorted list).
I remember being disappointed by the answer after reading these papers but in the citation graph there are a bunch of relatively recent program synthesis papers. Maybe there's something better now.
Thanks for the reference, it's definitely relevant. I took a pass over the paper somewhere between skimming and reading, and, while it's definitely in the ballpark of deriving parallel programs from sequential, it's also pretty clear their source language can't express the stack monoid problem because it lacks a "cons" operation; basically it seems to be for various fancy reduction problems where it consumes a sequence and spits out a scalar. On a quick read, I have no idea whether that's a fundamental limitation or something that could be extended.
I think examples like generating mergesort from insertion sort are maybe trying to be too clever. I mostly just want something that lets you slice up the problem into chunks, process each chunk in parallel, then combine them back in ways that don't kill performance on modern GPU hardware.
> I am even more convinced than before that efficient parsing is possible on GPU.
IMO, the place to start looking for GPU-friendly parsing algorithms is Valiant's algorithm, which is asymptotically the fastest known general CFG parsing algorithm, and is implemented in terms of Boolean matrix multiplication.
The intuition is that bottom-up parsing is basically a transitive closure computation, which is basically the same as solving systems of linear equations over a Boolean ring, which can be sped up using Strassen's algorithm.
A bit of Googling suggests that people have looked at optimising Boolean matrix multiplication on GPUs, but I have no idea what the state of the art here is.
A note: I just updated the post with a bunch of related work that people have sent me since I published the first draft. It also includes a bit more intuition behind the monoid.
How do you get started with learning GPU programming? I mean that in a pedagogical way, like what is the best educational way forward. Build stupid stuff in shader toys?
It's a good question, and it's been suggested I write a GPU compute 101. I did do a talk at Jane Street earlier in the year called "A taste of GPU compute" which has some relevant resources (a Google search on that term will give you a number of links).
I think there are probably a number of on-ramps. One easy way to get started is shadertoy, which if nothing else should give familiarity with GLSL syntax and intuition for performance in the simplest case: each "thread" computes something (a pixel value) independently of the others. You'll quickly run into limitations though, as it can't really do any of the more advanced compute stuff. I think as WebGPU becomes real, an analogous tool that unlocks compute kernels (aka compute shaders) could be very powerful for pedagogy.
I think most people that do compute on GPU use CUDA, as it's the only really practical toolchain for it. That has a large number of high quality open source libraries, which tend to be well-documented and have good research papers behind them. You can start by using these libraries, then digging deeper to see how they're implemented.
As I've been going on about, I believe this space is ripe for major growth in the next decade or so. As a rough guide, if you can make your algorithm fit the GPU compute model nicely, you'll get about 10x the compute per dollar, which is effectively the same as compute per watt. Why leave such performance on the table? The answer is that programming GPUs is just too hard. In certain areas, including machine learning, an investment of research has partially fixed that problem. But in others there is fruit at medium height, ripe for the picking. And in some other areas, you'll need a tall ladder, of which I believe a solid understanding of monoids is but one rung.
Thanks, it was your work that kindled my interest in programming the GPU. The fact that most vector graphics are done by CPU libraries made me resize it's still a very underdeveloped area. The hard thing about it is also what makes it interesting, it's not just a different programming language but a different architecture with other runtime characteristics. Sort of the closest you can get to owning a quantum computer. Anyway I'll learn some math and cuda, and buy the book if you find the time :)
In a sense GPU programming is similar to cellular automata. Each cell is an invocation on the GPU, and you can assume that you may have unlimited number of these invocations. You can freeze cells infinitely or temporarily, but you cannot group them. You can assume you know the identifier number of each cell, and you can assume you know the number of cells for each program. The hard part is now to generalize algorithms in which only control structures are based on freezing, identifying, and counting of cells. The cells can communicate information to the neighboring cells, and to any arbitrary cell, but the latter is not good for performance.
Most universities don't really teach parallel algorithm development in any way despite the core count increasing in both CPU and GPU world.
Current languages like GLSL and CUDA are very abstracted in the sense that they don't really tell why some approach is better or worse than the other. To me, if you can create as general cellular automata -like solution, then such logic can be converted and is generally fine performance-wise for SIMD orientated programming.
About the stack/monoid structure, there was a nice talk by Conal Elliott recently describing how it can be used to implement a compiler [0]. He also mentions parallel computations at some point.
I may be extrapolating too much but I think Steele team workig on Fortress was also doing monoid like thinking to parallelize heavy. He made a talk on how to delinearize problems into partially solved subproblems to be reconciled later.
A monoid is a mathematical structure which captures associativity (a+(b+c) = (a+b)+c) and has an identity notion (a + 0 = 0 + a = a).
They're useful because knowing them allows for identifying and mechanically applying transformations to better leverage parallelism and easily implement incremental computations†.
If you've been programming for a while, I can guarantee you already intuitively understand the concept. Giving it a name allows you to automatically recognize and notice that you've already solved this problem before in another guise.
Alternatively, having named the concept, it's now chunked into your toolkit for use whenever applicable. Structuring your solution so you know it's a monoid allows leveraging all those benefits almost for free.
†With divide and conquer, add memoization and pay attention to the order of subproblems such that you can reuse earlier work, and if you can, we're nearly at dynamic programming. We can specify a bit more structure, to get semirings and collapse a whole heap of disparate seeming machine learning algorithms. But now I have gone too far afield.
A monoid is a concept from math. They have a set of objects (like the integers), and a binary operation (like addition or multiplication over integers). The operation is associative (like addition where the order you compute a set of additions doesn’t matter, you get the same sum). And there’s an identity element (in addition this would be 0).
For CS there are some nice results where the operation can be automatically parallelized. Useful for things like MapReduce if you need to run things at large scale, or in this case for distributing the work over the many shader units of a GPU.
So a bunch of people are giving you good definitions. Here's an article that does FizzBuzz with monoids. I think this is the first thing I saw that actually cemented them into my mind. Probably because it takes a problem-to-solution teaching method. (Unfortunately looks like the original page is dead.)
I just notice a common pattern in a lot of these answers
* First, give a definition
* then some basic examples to clarify the concept
* then some advanced usage to demonstrate the advantages.
I tend to do it myself, maybe because a math concept has a clear definition and math courses do it too. However, it scares away a lot of people, including myself when I'm on the receiving end.
A monoid is a set S with an associative binary operation and a unit. For instance, strings with string concatenation and the empty string form a monoid.
Functions form a monoid as well. Function composition is associative and id is the unit.
From a programming perspective it is a way go generalize adding two things together, but where the order might matter (or might not!).
+ is a monoid. It adds to numbers together. (3+3)+4 = 3+(3+4)
But also concatenation is a monoid
"a" + ("b" + "c") = ("a" + "b") + "c".
Monoids have an empty thing you can add to anything.
E.g. 0 for + and "" for concatenation.
In Haskell, Monoids are abstracted out into a typeclass so that you can write code on any monoid. For example fold up a list of monoid elements into a single element.
To relate it to something you're more likely to already know: it's like a group (in the mathematical sense), except that it doesn't necessarily have inverses.
The paper it mentions and links (The Early Development of the Algebraic Theory of Semigroups) is freely available and has a bunch of discussion of terminology.
Stacks are used in parsing algorithms. In order to create a parallelized parsing algorithm the author has found this monoid representing stacks and stack operations (specifically push and pop). If this pans out, the result is a potential tool in running parsing algorithms in parallel (specifically on GPUs in their desired use-case).
Thank you. So am I correct that he had created a type (for lack of a better term) representing a number of pushes and pops. And the monoid is the algebraic construct that let's us combine those types. So faced with a string of square braces representing pushes and pops like in the example, we can divide them into chunks, count push and pops (also recording indices?) in parallel, and then combine them?
Yes, that's basically it. There are more details, like how I think this can efficiently be implemented on GPU hardware (it's not clear at all that this is the case), but you and GP have captured the basic motivation and idea.
An important example from CS would be the semigroup action on the set of states of a deterministic finite automaton.
The action takes a state from Q, and a symbol from E, and returns a new state. ie f:QxE->Q.
In the case when you have some "no op" action the semigroup action is actually a monoid action.
This all corresponds to the DFA moving through states as it "eats" the string.
Yupyup. For those of us who don't just automatically know what a semigroup action is, Dan Piponi's blog on using monoids to do incremental regular expression matching is probably a good read: http://blog.sigfpe.com/2009/01/fast-incremental-regular-expr...
Raph, I think some of the order-independent transparency folks (Bavoil, Wyman, Lefohn, Salvi, etc.) have had to do this bounded k-stack plus overflow thing. Might be a good "hmm, how did they do it" for comparison.
Hmm. In the end, maybe transparency is a bad example, because people immediately jumped to approximating it instead. Wyman's review paper though is a good read regardless:
The GPU / accelerator BVH traversals are perhaps more applicable (e.g., https://www.embree.org/papers/2019-HPG-ShortStack.pdf) but they also have a different "I can restart my computation" property that say JSON parsing wouldn't have (though maybe if you're willing to do a lot of restarts once you parse the inner portions and push them onto a heap...).
The bell it rang for me was [1], which has come up in discussions with Patrick Walton about maintaining a sort order for coarse rasterization in 2D vector rendering. Taking another look at that, it seems like they propose both bounded k-stacks and linked lists as solutions for recording all the fragments to be combined for transparency blending, but not the hybrid of the two. I wouldn't be surprised if that's been considered.
The problem is slightly different, though, because the major focus for transparency is concurrent writing from potentially many different shaders, while in the stack case each thread can build its data structure without any concurrency or need for atomics; once an aggregate is built, it is "published" (see my prefix sum blog for details on that) and treated as immutable.
These kinds of concurrent approaches do become important for later stages in the JSON parsing process, like building hashmaps for keys in a dictionary. I'm pretty sure the whole problem is tractable and have thoughts how to solve it, but deliberately kept the scope of this blog post limited, as I figured it was challenging enough. But it's very cool to reach people who appreciate the ideas, and if you find anything else that's relevant, I'd love to hear from you.
My first thought when seeing this stack problem was that it might be possible to use the techniques presented in "Internally Deterministic Parallel Algorithms Can Be Fast"[1]. The idea is that the algorithm runs on the prefix of the input in two steps: In the first step it will figure out which items can be processed in parallel (e.g. those that don't depend on any earlier input) and in the second step it will actually calculate it. Both steps are done in parallel. There's also a slide deck[2] which explains it quite well.
I'm not quite sure how it would work on this problem, but it might be interesting to look into.
The elements of this monoid kind of look like type signatures for Forth words. I was already playing with the idea of a sort of concurrent forth-like where each word sort of gets its own stack context. I'm really interested in neat models that provide partial evaluation, so I will be giving this some thought.
I had this crazy idea a while back using Ray tracing APIs to piggy back a database. Testing for a row could be done shooting in parallel rays to a structure mapped to the data.
Inverse semigroups are used in the geometry of interaction to give a parallelizable interpretation of lambda calculus. I don't have a great introductory reference. Here's some people who have implemented a system based on the theory: https://www.cambridge.org/core/journals/mathematical-structu...
Sorry for the citation dump, I'm short on time and I thought they would be interesting to those who found this submission interesting.
References
https://en.wikipedia.org/wiki/Dyck_language
https://en.wikipedia.org/wiki/Bicyclic_semigroup