

Compiling To Javascript In Continuation Passing Style, Can We Optimize? - jlongster
http://jlongster.com/2012/05/11/cps-optimizations.html

======
cjfrisz
I've spent a number of months working on using CPS and trampolining to add
proper tail calls to Clojure. I have a video presenting my work here:
<http://www.chrisfrisz.com/blog/?p=220> . You can also take a look at the code
here: <https://github.com/cjfrisz/clojure-tco> .

The punchline is that there are several things you can do to improve your
performance: first, a number of more efficient CPS algorithms have been
developed over the last 10+ years. I currently use one presented by Danvy in
the 2001 paper "A First-Order One-Pass CPS algorithm" linked here:
www.brics.dk/RS/01/49/BRICS-RS-01-49.pdf . It's major focus is to distinguish
between "serious" and "trivial" expressions, because the serious ones are the
only ones that actually require a new continuation to be generated. This saves
a lot of administrative redexes that can kill your performance.

Some other CPS-related avenues you may want to look into include "The Essence
of Compiling with Continuations" (co-authored by one of my heroes, Amr Sabry)
which formalizes A-normal form, an alternative to compiling with continuations
that was ground-breaking in pointing out the number of unnecessary redexes
introduced by traditional CPS algorithms. There's also a more recent paper by
Andrew Kennedy at MS Research called "Compiling with Continuations, Continued"
which I admittedly have not gotten a chance to dig into yet. It's stated goal
is to revive some of the ideas of the original "Compiling with Continuations"
by Andrew Appel.

Of course with the method you present, continuation application isn't your
only performance hit. If you're using a trampoline, you're undoubtedly
thunkifying the code to delimit the bounces on the trampoline. I also need to
do some work in this area, but I'm familiar with at least one paper on doing
minimal thunkification from 1993 and linked here:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.158....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.158.7919)
.

I'm a big fan of both CPS+trampolining and the work like ClojureScript that's
bringing efficient Lisp programming to the JavaScript arena. I'm hoping to get
a chance to look over the work that you've done with Outlet and I'd love to
chat with you about it. It sounds like we share quite similar interests.

~~~
jlongster
I already watched your video! It was one of the things that inspired me to try
this. Some of it was over my head (not extremely involved in the academic
world), but I'm at a place now where I can dig into clojure-tco and see what
you've done.

Thanks for the papers, I'll look at them. I'm worried about spending a lot of
time on this, only to realize that the performance hit is simply too large. I
want this for debugging, and what I really want to try is to write a web-based
byte code VM and compile Outlet to it. But the same concern exists there: can
I get it fast enough to run sufficiently?

I'm leaning more towards the CPS route though. First of all, there's tons of
research going into it. Secondly, I lose all the benefits of JIT-ed Javascript
if I run it in a VM. There's actually a much higher chance that I'll be able
to get CPS-ed code running fast than compiled byte code.

I admit that my current implementation is naive, and yes, the trampolines
require even more thunks.

I'd love to talk, I'm very interested in this as well. How far has clojure-tco
come? What's the performance hit, and how do you return code to the
trampoline? I will look at your code soon, and be lurking around #clojure as
jlongster.

~~~
cjfrisz
I've been finishing up grad school for the past couple of weeks, so Clojure
TCO hasn't seen updates for a while. One of the things on my to-do list for it
is to do more extensive benchmarking, but I've thrown a lot of ad-hoc tests at
it, and I'm getting remarkably close performance to standard Clojure code.
Especially with your interest, I want to nail down some numbers to have solid
performance comparisons.

I noticed that somebody commented about allocation being a big source of slow
down. Someone suggested to me that if you have low-level enough access to your
implementation, you can reuse the thunk after each invocation. It took me a
while to get my head around it, and I'm not even sure it can be done in
Clojure TCO. If you can change the contents of a underlying closure data
structure, then you only have to allocate one thunk. I suspect that you're
taking advantage of JavaScript's closures, so you may not have any more luck
with it than me.

If you're interested, one of my next goals now that I'm done with school is to
start writing about the algorithms that Clojure TCO uses. I want to write
something fairly accessible that describes the Danvy CPS algorithm.

~~~
jlongster
"... I've thrown a lot of ad-hoc tests at it, and I'm getting remarkably close
performance to standard Clojure code"

That's amazing!

"Someone suggested to me that if you have low-level enough access to your
implementation, you can reuse the thunk after each invocation ..."

Yep, I thought of that when I was looking at the generated js code. I'm think
of out of luck as well. However, as I mentioned at the end of my article, I
think you should be able to hoist all the functions into top-level forms and
pass around an environment. This way, functions are statically allocated once,
and the performance hit of passing around environments can't be nearly as
great as function allocation.

------
jules
Why do you need to transform to CPS? The answer to your question will depend
very much on your answer to that question. In the extreme if you are not doing
anything with the CPS form, then probably the best optimization of the CPSed
form will completely undo the CPS, transforming it back into direct style. Of
course if you are actually making use of the CPS form, that's not possible
anymore.

Depending on why you want CPS, you can apply the following strategy:

Keep all code in direct style. For each function, designate a special return
token CaptureCont that indicates that the continuation should be captured. The
source of this token will be callcc. After each function call you check if it
returned a token, and if so, extend it with the continuation for the rest of
the computation. Then on the main() function of your program, pass the
continuation to the appropriate place. This strategy avoids most overhead in
the common case: one if instanceof check plus a branch per function call.

For example, take this function:

    
    
        def foo(x,a):
          y = g(x)
          return y+a
    

Transform to:

    
    
        class CaptureCont:
          member target # the function that we will pass the continuation to once we finish constructing it
          member partialCont # the part of the continuation captured so far
          def extend(f):
            newCC = new CaptureCont()
            newCC.target = target
            newCC.partialCont = lambda v: f(partialCont(v))
          def invoke():
            target(partialCont)
    
        def foo(x):
          y = g(x)
          if y is CaptureCont:
            # now y is the part of the continuation captured so far
            # our job is to extend it with our stack frame
            # and then return it so that the callee can build it up further
            return y.extend(lambda y2: y2+a)
          else:
            return y+a
    

Instead of just invoking main(), do this:

    
    
        loop:
          v = main()
          if v is CaptureCont: v.invoke()
          else: exit(v) # well I guess this is not necessary in JS
    

Eventually the CaptureCont will bubble up to main, and at that point main will
invoke it so that the continuation is passed to the right target.

Callcc:

    
    
        def callcc(f):
          cc = new CaptureCont()
          cc.target = f
          cc.partialCont = lambda x: x
          return cc
    

There are various optimizations that can be applied (like eliding the capture
code for functions that you statically know don't call callcc and thus will
never return CaptureCont), and this approach also (trivially) works for the
more general delimited continuations. As described capturing the continuation
is O(n) in the size of the stack. You can make it amortized O(1) with some
more trickery by lazily rebuilding the stack, but the current approach is
already blazingly fast in the case of no continuation capture. I'm really
tired now, so there are probably lots of errors in the above description, so
tread with caution. I don't know if this approach is new or not, but it
probably isn't given that it's a fairly obvious approach to capture the stack.
Does anybody know what it's called? Then the OP can find a better description.

~~~
jlongster
That's a smart trick, and it's possible I might be able to do something like
that.

The reason I want to do CPS is for debugging in the browser. I want to be able
to control the stack, so when a breakpoint hits I can pause everything and
step through the code. The browser is non-blocking so I need explicit control
of the stack.

You're method is neat. My language doesn't support continuations, so I don't
need them until I hit the debugging mode. You've made me curious if I can
compile out both CPS-ed and non-CPS-ed versions of the code and somehow switch
between them.

~~~
gliese1337
I did a very similar thing with a python-implemented LISP interpreter, mainly
so that I could implement a trampoline so I could do tail call optimization.
It required CPSing the _interpreter_ (and built-in functions), but not the
interpreted program- this works because the continuation of the interpreted
program is contained in the continuation of the interpreter that runs it. The
interpreted "stack" ends up as a chain of Python closures.

You might be able to get away with compiling a runtime environment slightly
more complex than your simple trampoline loop that does the same thing,
building a chain of JavaScript closures to represent continuations without
actually CPSing the rest of the program. Continuation calls would then show up
on the JavaScript stack and allow you to step through them.

~~~
jlongster
Wouldn't that be even slow though? Not only is it CPS code, but it's a program
being interpreted by CPS code. I think I see what you're getting at though.
Sounds neat; I definitely have a lot of new ideas for implementing CPS in a
performant way now.

Outlet is a compiler which compiles straight to js, no interpretation.

~~~
gliese1337
Not sure; I haven't got anything to benchmark my implementation against. My
suspicion is that it would be faster because you get chains of interpreted-
continuations-implemented-as-interpreting-closures that call each other
normally, rather than trampolining absolutely everything. But I would not be
surprised if I turn out to be completely wrong about that.

> Outlet is a compiler which compiles straight to js, no interpretation.

Right; hence, you would have to include the implementation of the CPSed run
time environment in the compiled output, just as you have to include the extra
trampoline function when you do the current CPS transform.

------
larsberg
1) Using one of the more efficient CPS transformations will help remove many
stupid constructs that are harder to statically find later.

2) In our CPS-based intermediate representation, we rely heavily on control
flow analysis (CFA) to enable many classic compiler optimizations (inlining,
untupling, unreachable code elimination, cases where only one member of a
datatype is constructed/destructed, etc.).

CPS is a great transformation and makes for a wonderful IR, but you have to
pay a decent static analysis cost before you will be able to perform many
classic optimizations and get great performance. That said, we've done pretty
well (<http://manticore.cs.uchicago.edu> ), but it's taken years of
development. I'd also like to second cjfrisz's pointer to Appel's green book.
It's a fabulous introduction to compiling in a CPS-style IR, though you'll
find you still have a fairly large performance gap if you don't implement the
CFA-based versions of them.

~~~
MaysonL
Re the manticore home page, a minor nit: the antonym of fine-grained is
_coarse_ -grained, not course-grained.

~~~
larsberg
Thanks! That text has been there for at least 6 years and we've never noticed.

~~~
MaysonL
It's also in the wiki, I believe.

------
dgreensp
I think your intuition is exactly correct; it's the function allocations that
are killing you.

Fascinating stuff.

------
asynchrony
If you haven't read it already, this paper is quite interesting with regards
to supporting TCO in javascript (using trampolines as a last-resort).
<http://www-sop.inria.fr/indes/scheme2js/files/tfp2007.pdf>

------
amasad
I went down this road before.

I had an idea but never got around to implement it. Is to use the new es-
harmony generators to pause execution in non-cps-ed version of the code. I'm
happy to further elaborate on that if you're interested.

~~~
jlongster
I thought of that at one point, but the lack of browser support kills it for
me unfortunately. :/

------
pwpwp
Could you try your code in some other JS-based Schemes for comparison, e.g.
<http://www.biwascheme.org/> ?

~~~
jlongster
I don't see any reference to continuations in there.

~~~
pwpwp

       biwascheme> (call-with-current-continuation (lambda (k) (k 12)))
       => 12
    

It's based on Dybvig's "Three Implementation Models for Scheme"
<http://www.cs.unm.edu/~williams/cs491/three-imp.pdf>

~~~
jlongster
Oh, cool. It appears to be an interpreter, which is interesting, and worth
profiling to see how it performs. I'll have to do that later.

