
The Stack Monoid - raphlinus
https://raphlinus.github.io/gpu/2020/09/05/stack-monoid.html
======
bitdizzy
This language of matching brackets is also called the Dyck language and its
syntactic monoid is the bicyclic semigroup, which is an interesting example of
an inverse semigroup. Edward Kmett gave a talk on programming with these sorts
of algebraic structures that some might find illuminating:
[https://www.youtube.com/watch?v=HGi5AxmQUwU](https://www.youtube.com/watch?v=HGi5AxmQUwU)

Inverse semigroups are used in the geometry of interaction to give a
parallelizable interpretation of lambda calculus. I don't have a great
introductory reference. Here's some people who have implemented a system based
on the theory: [https://www.cambridge.org/core/journals/mathematical-
structu...](https://www.cambridge.org/core/journals/mathematical-structures-
in-computer-science/article/abstract-machines-optimal-reduction-and-
streams/60F590DA98EAC48CC6E7AADB46235B8C)

Sorry for the citation dump, I'm short on time and I thought they would be
interesting to those who found this submission interesting.

References

[https://en.wikipedia.org/wiki/Dyck_language](https://en.wikipedia.org/wiki/Dyck_language)

[https://en.wikipedia.org/wiki/Bicyclic_semigroup](https://en.wikipedia.org/wiki/Bicyclic_semigroup)

~~~
raphlinus
The Dyck language I was already aware of; I linked it in the previous post in
the series. But the bicyclic semigroup was new to me. Yes, the bones of it are
very similar to my "stack monoid;" if you erase the actual element values of
the stack contents and just count the lengths, it's clearly the same thing. No
doubt there's some theory that justifies adding the elements back in as well.

I'll have to check out the Kmett talk - from the description it sounds very
interesting.

~~~
nilknarf
Although I didn't know it by that name, I have seen the bicyclic semigroup in
competitive programming many times before (usually any problem involving
intervals/brackets and segment trees).

For example in this problem
[https://codeforces.com/contest/1285/problem/E](https://codeforces.com/contest/1285/problem/E),
one solution is to parse and count the number of top-level parentheses after
small modifications. For example the original string might be "(()())()". Then
"(()())" is 1, "(())()" is 2, "()()()" is 3.

This is solved with the following monoid to make incremental reparsing fast:

    
    
        neutral = Node(0, 0, 0)
    
    
        def combine(x, y):
            # Each node tracks numClose, numCount, numOpen:
            # e.g., )))) (...)(...)(...) ((((((
            if x.open == y.close == 0:
                ret = x.close, x.count + y.count, y.open
            elif x.open == y.close:
                ret = x.close, x.count + 1 + y.count, y.open
            elif x.open > y.close:
                ret = x.close, x.count, x.open - y.close + y.open
            elif x.open < y.close:
                ret = x.close + y.close - x.open, y.count, y.open
            return Node(*ret)
    
    

Reference solution:
[https://codeforces.com/contest/1285/submission/92169958](https://codeforces.com/contest/1285/submission/92169958)

Monoid cached trees are extremely popular in competitive programming, but
never by that name. It's always referred to as a "segment tree" without using
any mathematical terminology more advanced than "associativity".

Due to its popularity, all kinds of crazy monoid states have already been
explored (though you need to squint a bit to see them due to the terminology
mismatch):
[https://codeforces.com/blog/entry/15890](https://codeforces.com/blog/entry/15890).
I've never seen stack used though. You almost always want to compress the
state as much as possible and the problems I've seen only needed stack depth
(aka the bicyclic semigroup), not contents. Requiring O(N) to merge also
highly limits its use cases.

Anyway, not sure how relevant this is to you since the use of monoids in data
structures and for parallelism is slightly different (binary merges all the
way down versus stopping at some chunk size).

------
nilknarf
> Automatically generating monoids such as the stack monoid from their
> corresponding simple sequential programs seems like an extremely promising
> area for research.

I was interested in this problem a while back and the papers I remembered
being useful were:

"Automatic Inversion Generates Divide-and-Conquer Parallel Programs":
[https://core.ac.uk/download/pdf/192663016.pdf](https://core.ac.uk/download/pdf/192663016.pdf)

which is based on "The Third Homomorphism Theorem":
[http://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/th...](http://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/thirdht.ps.gz)

The motivating example is whether you can generate a mergesort (where the
monoid is combine two sorted lists) given an implementation of an insertion
sort (where the sequential program is inserting one element by one into a
sorted list).

I remember being disappointed by the answer after reading these papers but in
the citation graph there are a bunch of relatively recent program synthesis
papers. Maybe there's something better now.

~~~
raphlinus
Thanks for the reference, it's definitely relevant. I took a pass over the
paper somewhere between skimming and reading, and, while it's definitely in
the ballpark of deriving parallel programs from sequential, it's also pretty
clear their source language can't express the stack monoid problem because it
lacks a "cons" operation; basically it seems to be for various fancy reduction
problems where it consumes a sequence and spits out a scalar. On a quick read,
I have no idea whether that's a fundamental limitation or something that could
be extended.

I think examples like generating mergesort from insertion sort are maybe
trying to be too clever. I mostly just want something that lets you slice up
the problem into chunks, process each chunk in parallel, then combine them
back in ways that don't kill performance on modern GPU hardware.

------
neel_k
In the conclusion, raph writes:

> I am even more convinced than before that efficient parsing is possible on
> GPU.

IMO, the place to start looking for GPU-friendly parsing algorithms is
Valiant's algorithm, which is asymptotically the fastest known general CFG
parsing algorithm, and is implemented in terms of Boolean matrix
multiplication.

The intuition is that bottom-up parsing is basically a transitive closure
computation, which is basically the same as solving systems of linear
equations over a Boolean ring, which can be sped up using Strassen's
algorithm.

A bit of Googling suggests that people have looked at optimising Boolean
matrix multiplication on GPUs, but I have no idea what the state of the art
here is.

~~~
082349872349872
Didn't SIGFPE's blog cover this years ago?

Bonus clip:
[https://www.youtube.com/watch?v=j4YNPhllDXU](https://www.youtube.com/watch?v=j4YNPhllDXU)

------
raphlinus
A note: I just updated the post with a bunch of related work that people have
sent me since I published the first draft. It also includes a bit more
intuition behind the monoid.

------
wrnr
How do you get started with learning GPU programming? I mean that in a
pedagogical way, like what is the best educational way forward. Build stupid
stuff in shader toys?

~~~
raphlinus
It's a good question, and it's been suggested I write a GPU compute 101. I did
do a talk at Jane Street earlier in the year called "A taste of GPU compute"
which has some relevant resources (a Google search on that term will give you
a number of links).

I think there are probably a number of on-ramps. One easy way to get started
is shadertoy, which if nothing else should give familiarity with GLSL syntax
and intuition for performance in the simplest case: each "thread" computes
something (a pixel value) independently of the others. You'll quickly run into
limitations though, as it can't really do any of the more advanced compute
stuff. I think as WebGPU becomes real, an analogous tool that unlocks compute
kernels (aka compute shaders) could be very powerful for pedagogy.

I think most people that do compute on GPU use CUDA, as it's the only really
practical toolchain for it. That has a large number of high quality open
source libraries, which tend to be well-documented and have good research
papers behind them. You can start by using these libraries, then digging
deeper to see how they're implemented.

As I've been going on about, I believe this space is ripe for major growth in
the next decade or so. As a rough guide, if you can make your algorithm fit
the GPU compute model nicely, you'll get about 10x the compute per dollar,
which is effectively the same as compute per watt. Why leave such performance
on the table? The answer is that programming GPUs is just too hard. In certain
areas, including machine learning, an investment of research has partially
fixed that problem. But in others there is fruit at medium height, ripe for
the picking. And in some other areas, you'll need a tall ladder, of which I
believe a solid understanding of monoids is but one rung.

~~~
wrnr
Thanks, it was your work that kindled my interest in programming the GPU. The
fact that most vector graphics are done by CPU libraries made me resize it's
still a very underdeveloped area. The hard thing about it is also what makes
it interesting, it's not just a different programming language but a different
architecture with other runtime characteristics. Sort of the closest you can
get to owning a quantum computer. Anyway I'll learn some math and cuda, and
buy the book if you find the time :)

------
chombier
About the stack/monoid structure, there was a nice talk by Conal Elliott
recently describing how it can be used to implement a compiler [0]. He also
mentions parallel computations at some point.

[0]
[https://www.youtube.com/watch?v=wvQbpS6wBa0](https://www.youtube.com/watch?v=wvQbpS6wBa0)

~~~
agumonkey
I may be extrapolating too much but I think Steele team workig on Fortress was
also doing monoid like thinking to parallelize heavy. He made a talk on how to
delinearize problems into partially solved subproblems to be reconciled later.

~~~
rudedogg
Is this the talk?:
[https://youtu.be/EZD3Scuv02g?t=2192](https://youtu.be/EZD3Scuv02g?t=2192)

~~~
agumonkey
hmm related but not the one I had in mind

[https://www.infoq.com/presentations/Thinking-Parallel-
Progra...](https://www.infoq.com/presentations/Thinking-Parallel-Programming/)

ps: you can see they were using mathematical structures (ring, monoids) in
your video anyway

~~~
rudedogg
Thanks, the talk you linked goes into more detail.

------
baybal2
What is a monoid? I googled to no avail. Lots of articles, but none tell what
the heck it actually is.

~~~
jdmichal
So a bunch of people are giving you good _definitions_. Here's an article that
does FizzBuzz with monoids. I think this is the first thing I saw that
actually cemented them into my mind. Probably because it takes a problem-to-
solution teaching method. (Unfortunately looks like the original page is
dead.)

[https://web.archive.org/web/20200325013049/http://dave.fayr....](https://web.archive.org/web/20200325013049/http://dave.fayr.am/posts/2012-10-4-finding-
fizzbuzz.html)

~~~
hyperman1
I just notice a common pattern in a lot of these answers

* First, give a definition

* then some basic examples to clarify the concept

* then some advanced usage to demonstrate the advantages.

I tend to do it myself, maybe because a math concept has a clear definition
and math courses do it too. However, it scares away a lot of people, including
myself when I'm on the receiving end.

------
boulos
Raph, I think some of the order-independent transparency folks (Bavoil, Wyman,
Lefohn, Salvi, etc.) have had to do this bounded k-stack plus overflow thing.
Might be a good "hmm, how did they do it" for comparison.

~~~
raphlinus
Good call, that rings a bell and I'll look into it. A bounded k-stack with
overflow seems like a really good fit for the capabilities provided by GPU.

~~~
boulos
Hmm. In the end, maybe transparency is a bad example, because people
immediately jumped to approximating it instead. Wyman's review paper though is
a good read regardless:

[http://cwyman.org/papers/hpg16_oitContinuum.pdf](http://cwyman.org/papers/hpg16_oitContinuum.pdf)

The GPU / accelerator BVH traversals are perhaps more applicable (e.g.,
[https://www.embree.org/papers/2019-HPG-
ShortStack.pdf](https://www.embree.org/papers/2019-HPG-ShortStack.pdf)) but
they also have a different "I can restart my computation" property that say
JSON parsing wouldn't have (though maybe if you're willing to do a lot of
restarts once you parse the inner portions and push them onto a heap...).

Anyway, cool problem!

~~~
raphlinus
The bell it rang for me was [1], which has come up in discussions with Patrick
Walton about maintaining a sort order for coarse rasterization in 2D vector
rendering. Taking another look at that, it seems like they propose _both_
bounded k-stacks and linked lists as solutions for recording all the fragments
to be combined for transparency blending, but not the hybrid of the two. I
wouldn't be surprised if that's been considered.

The problem is slightly different, though, because the major focus for
transparency is concurrent writing from potentially many different shaders,
while in the stack case each thread can build its data structure without any
concurrency or need for atomics; once an aggregate is built, it is "published"
(see my prefix sum blog for details on that) and treated as immutable.

These kinds of concurrent approaches do become important for later stages in
the JSON parsing process, like building hashmaps for keys in a dictionary. I'm
pretty sure the whole problem is tractable and have thoughts how to solve it,
but deliberately kept the scope of this blog post limited, as I figured it was
challenging enough. But it's very cool to reach people who appreciate the
ideas, and if you find anything else that's relevant, I'd love to hear from
you.

[1]: [https://on-
demand.gputechconf.com/gtc/2014/presentations/S43...](https://on-
demand.gputechconf.com/gtc/2014/presentations/S4385-order-independent-
transparency-opengl.pdf)

------
judofyr
My first thought when seeing this stack problem was that it might be possible
to use the techniques presented in "Internally Deterministic Parallel
Algorithms Can Be Fast"[1]. The idea is that the algorithm runs on the prefix
of the input in two steps: In the first step it will figure out which items
can be processed in parallel (e.g. those that don't depend on any earlier
input) and in the second step it will actually calculate it. Both steps are
done in parallel. There's also a slide deck[2] which explains it quite well.

I'm not quite sure how it would work on this problem, but it might be
interesting to look into.

[1]:
[https://www.cs.cmu.edu/~guyb/papers/BFGS12.pdf](https://www.cs.cmu.edu/~guyb/papers/BFGS12.pdf)

[2]:
[http://www.cs.cmu.edu/~blelloch/papers/spaapodc17.pdf](http://www.cs.cmu.edu/~blelloch/papers/spaapodc17.pdf)

------
andrewflnr
The elements of this monoid kind of look like type signatures for Forth words.
I was already playing with the idea of a sort of concurrent forth-like where
each word sort of gets its own stack context. I'm really interested in neat
models that provide partial evaluation, so I will be giving this some thought.

------
acoye
I had this crazy idea a while back using Ray tracing APIs to piggy back a
database. Testing for a row could be done shooting in parallel rays to a
structure mapped to the data.

------
agumonkey
best thread of the year

