
Fractal: An Execution Model for Fine-Grain Nested Speculative Parallelism [pdf] - deepnotderp
http://people.csail.mit.edu/sanchez/papers/2017.fractal.isca.pdf
======
arjo1
I tried to understand what is going on in this paper. Am I right to say that
it seems like they are looking at parallelizing subtasks within each domain
and at each time taking advantage of unordered tasks if possible?

~~~
suvinay
Correct. One of the goals of this work was to expose and exploit the abundant
parallelism within large _atomic_ tasks. However, the tasks within a domain
can be ordered or unordered. Our system will exploit any kind of parallelism
that is available.

------
suvinay
Hi all. This is Suvinay Subramanian from MIT. I am one of the authors on the
Fractal paper. I wanted to provide some context for this work, and briefly
summarize our paper. I hope this will address some of the comments on this
thread. And I am happy to answer any further queries you may have.

You may also be interested in the slides of this work presented at the
International Symposium on Computer Architecture (ISCA) last week:
[http://bit.ly/fractal-slides](http://bit.ly/fractal-slides)

What motivated this work? While multi-cores are pervasive today, programming
these systems and extracting performance on a range of real world applications
remains challenging. Systems that support speculative parallelization, such as
Transactional Memory (supported in hardware in commercial processors today,
such as Intel Haswell, IBM's Power 8 etc.) help simplify parallel programming,
by allowing programmers to designate blocks of code as atomic, in contrast to
using complex synchronization using locks, semaphores etc.

However, their performance on real world applications remains poor. This is
because real world applications are complex, and often have large atomic tasks
either due to necessity or convenience. For example, in database workloads,
transactions have to remain atomic (i.e. all actions within a transaction have
to "appear as a single unit"). But each transaction can be long (~few million
cycles) -- such large tasks are prone to aborts, and are expensive to track
(in software or hardware).

Now, there is often abundant parallelism inside such large tasks (as has been
noted in prior work). But the challenge is how to extract this parallelism
while also keeping the entire unit atomic. Prior approaches, including nested
Transactional Memory, and Thread-Level-Speculation (TLS) have been
unsuccessful in extracting this parallelism -- section 7 of our paper explains
why (or see slides).

Fractal provides a new execution model that addresses the limitations of prior
approaches, and proposes a hardware architecture that can successfully scale
several challenging workloads -- some of which have eluded successful
parallelization for decades. Importantly, Fractal allows us to operate at the
granularity of "small" tasks that are easy to track. Further Fractal enables
programmers to seamlessly compose speculative parallel algorithms.

What are the key ideas of Fractal? There are two key insights: 1\. Decouple
atomicity from parallelism -- We introduce the notion of a domain, where all
tasks belonging to a domain appear to execute as a single atomic unit. By
doing this, we can execute the "subtasks" in parallel, while ensuring
atomicity of the "collection of subtasks" via other mechanisms.

2\. Ordering to enforce atomicity -- Unlike prior techniques that "merge"
subtasks to enforce atomicity, we leverage ordering to enforce atomicity.

Tell me more about the hardware, aka the Swarm architecture. Conventional
wisdom dictates that ordering limits parallelism. Now ordered parallelism is
hard -- there have been several attempts at taming this beast, including
Thread-Level Speculation (TLS), as alluded to in one of the comments.

In our prior work, Swarm, we demonstrated that it is possible to extract
parallelism even in the presence of ordering constraints, with a more
expressive execution model, and some nifty hardware tricks.

At a high level, the Swarm hardware executes tasks speculatively, and elides
ordering constraints. Only true data dependences force an "ordered" execution.
On the hardware side, Swarm has large hardware task queues to manage tasks,
and a high-throughput commit protocol and speculation mechanisms that allow it
to scale.

You can read more on Swarm at:
[http://bit.ly/swarmarch](http://bit.ly/swarmarch)

Fractal relies on the Swarm hardware to enforce ordering guarantees while
extracting parallelism. The key challenges were how to assign a suitable
ordering to the subtasks that also maintains the atomicity of "domains" of
subtasks, how to assign this ordering dynamically, and how to support
"unbounded nesting", eg: if your subtask makes a library call, that itself
could be parallelized (but must still remain atomic), and so on (supporting
unbounded nesting is essential for software composition of algorithms).

Is Fractal / Swarm just an accelerator? We believe that Fractal and Swarm are
a more general solution to some of the hardest parallel computing problems. We
have shown that the ideas presented in these works are applicable in a wide
range of domains ranging from databases, to discrete-event simulation, to
machine learning, and graph analytics.

When will we see this in hardware? Currently, we evaluate our ideas in
simulation. But our group is working on prototyping this on an FPGA in the
coming year. So stay tuned!

~~~
deepnotderp
Awesome! I have a couple of questions if you don't mind:

1) Does this require the programmer to rewrite software or make annotations?
Even if that's the optimal way, can a compiler do this automatically, at least
to some extent?

2) how general of a workload can this accelerate? You mention speculative
multithreading and HTM in both papers, both of which are fairly general
purpose techniques.

~~~
suvinay
1\. Currently the programmer needs to delineate code into tasks. A couple of
points to note: \-- Often the programmer can start with the serial code, and
determine how to delineate tasks. In our experience this is often easier than
writing more complex parallelized versions of the same algorithm. Compared to
the serial code, our programs are just a few percent longer. \-- We have
implemented high-level interfaces along the lines of OpenMP or OpenTM to make
it easier for the programmer to break straight-line serial code into fine-
grain tasks. For example, if you have a for loop to parallelize, you can use
the fractal::forall(...) construct. Table 1 of the Fractal paper lists some of
these high-level interface functions.

We are currently working on a compiler that can take serial code and
automatically annotate tasks and domains. Keep watching our webpage for
updates on that.

2\. Our execution model subsumes both Transactional Memory (TM) and Thread-
Level Speculation (TLS). Any program that can be expressed in those paradigms
can be expressed in Fractal. And as we show in the paper, Fractal can express
parallelism that was not possible in TLS or TM. So we believe it is general
purpose. Further. as indicated above, we have seen these ideas to be
applicable to applications in a wide range of domains.

That being said, we are always on the lookout for applications that reveal
deficiencies in our execution model, to help improve it :)

~~~
deepnotderp
1\. Yes ,I've seen that, and I think that an OpenMP-esque annotation system
specified by the programmer would be optimal, but I don't see any reason that
a "sufficiently smart compiler" [ ;) ] can't produce an approximation. Given
the SpMT-like "thread" (task) squashing, it shouldn't cause any semantic
issues.

2\. Absolutely, SpMT has been proposed as a general purpose technique, that's
why I found it odd that the benchmarks weren't on a general purpose benchmark
like SpecINT, do you have any benchmarks on SpecINT or any plan to do so?

~~~
suvinay
2: We are currently working on a compiler to automatically delineate tasks,
assign domains etc. given sequential code. As part of that effort, we are
looking at SPEC workloads, yes.

------
exabrial
Is this a physical processor or scheduling algorithm?

~~~
suvinay
This is a new execution model + hardware architecture. Currently we have
evaluated it in simulation only. But stay tuned for hardware.

~~~
nkoren
Complete n00b question -- this is very, very far outside my domain -- but by
using much more of a chip in parallel, does this present any challenging
thermal implications?

~~~
suvinay
This is a concern orthogonal to our work. Today, there are commercial chips
with a large number of cores / processing elements today, e.g. NVidia's Volta
has ~5120 "cores", Intel's Xeon Phi has upto 72 cores. And there are
approaches and active research to handle thermal concerns in such large chips.
Maybe someone who is more knowledgeable can shed light on those approaches. I
do not think Fractal introduces any new thermal concerns.

------
znpy
Did i get it right, does this CPU use the x86-64 ISA ?

(see page 8, Table 2: Configuration of the 256-core system)

~~~
nn3
It's only simulated at this point.

There are fairly good cycle accurate open source x86 simulators, such as
PTLsim or marss86

[http://marss86.org/~marss86/index.php/Home](http://marss86.org/~marss86/index.php/Home)
[http://caps.cs.binghamton.edu/tools.html](http://caps.cs.binghamton.edu/tools.html)

------
kom107
Hi Suvinay! Do you all have a website that keeps interested people updated
about the upcoming Swarm hardware? I would like to follow this work. Thanks!

~~~
suvinay
Thank you for your interest. We are planning to create a web page for this
soon (likely [http://swarm.mit.edu](http://swarm.mit.edu) \-- it's not live
yet). Till then, you should be able to find our papers and some upcoming ones
at my advisor, Daniel Sanchez's web page:
[http://people.csail.mit.edu/sanchez/](http://people.csail.mit.edu/sanchez/).

------
brian_herman
Moore's law lives on?

~~~
microcolonel
Yeah, seems like it's just getting more discrete and less continuous.

That said, this is not an 88x speedup for general purpose computing, just for
some specific workload of parallel compute.

~~~
deepnotderp
If I read this correctly, this is effectively a better way to do thread level
speculation (aka Speculative Multithreading), so in theory this could help in
general purpose computing as well, just not 88x ;)

~~~
suvinay
Your intuition on this improving Thread Level Speculation (TLS) is correct. I
do want to clarify that this is not just a better implementation of TLS though
-- our execution model is more expressive and avoids some of the pitfalls of
TLS (see the Swarm paper [http://bit.ly/swarmarch](http://bit.ly/swarmarch) to
learn more). Yes, you could use our implementation with a TLS-like execution
model and improve on the TLS performance. But you are likely to reap more
rewards by using the Fractal execution model + implementation.

