You may also be interested in the slides of this work presented at the International Symposium on Computer Architecture (ISCA) last week: http://bit.ly/fractal-slides
What motivated this work?
While multi-cores are pervasive today, programming these systems and extracting performance on a range of real world applications remains challenging. Systems that support speculative parallelization, such as Transactional Memory (supported in hardware in commercial processors today, such as Intel Haswell, IBM's Power 8 etc.) help simplify parallel programming, by allowing programmers to designate blocks of code as atomic, in contrast to using complex synchronization using locks, semaphores etc.
However, their performance on real world applications remains poor. This is because real world applications are complex, and often have large atomic tasks either due to necessity or convenience. For example, in database workloads, transactions have to remain atomic (i.e. all actions within a transaction have to "appear as a single unit"). But each transaction can be long (~few million cycles) -- such large tasks are prone to aborts, and are expensive to track (in software or hardware).
Now, there is often abundant parallelism inside such large tasks (as has been noted in prior work). But the challenge is how to extract this parallelism while also keeping the entire unit atomic. Prior approaches, including nested Transactional Memory, and Thread-Level-Speculation (TLS) have been unsuccessful in extracting this parallelism -- section 7 of our paper explains why (or see slides).
Fractal provides a new execution model that addresses the limitations of prior approaches, and proposes a hardware architecture that can successfully scale several challenging workloads -- some of which have eluded successful parallelization for decades. Importantly, Fractal allows us to operate at the granularity of "small" tasks that are easy to track. Further Fractal enables programmers to seamlessly compose speculative parallel algorithms.
What are the key ideas of Fractal?
There are two key insights:
1. Decouple atomicity from parallelism -- We introduce the notion of a domain, where all tasks belonging to a domain appear to execute as a single atomic unit. By doing this, we can execute the "subtasks" in parallel, while ensuring atomicity of the "collection of subtasks" via other mechanisms.
2. Ordering to enforce atomicity -- Unlike prior techniques that "merge" subtasks to enforce atomicity, we leverage ordering to enforce atomicity.
Tell me more about the hardware, aka the Swarm architecture.
Conventional wisdom dictates that ordering limits parallelism. Now ordered parallelism is hard -- there have been several attempts at taming this beast, including Thread-Level Speculation (TLS), as alluded to in one of the comments.
In our prior work, Swarm, we demonstrated that it is possible to extract parallelism even in the presence of ordering constraints, with a more expressive execution model, and some nifty hardware tricks.
At a high level, the Swarm hardware executes tasks speculatively, and elides ordering constraints. Only true data dependences force an "ordered" execution. On the hardware side, Swarm has large hardware task queues to manage tasks, and a high-throughput commit protocol and speculation mechanisms that allow it to scale.
You can read more on Swarm at: http://bit.ly/swarmarch
Fractal relies on the Swarm hardware to enforce ordering guarantees while extracting parallelism. The key challenges were how to assign a suitable ordering to the subtasks that also maintains the atomicity of "domains" of subtasks, how to assign this ordering dynamically, and how to support "unbounded nesting", eg: if your subtask makes a library call, that itself could be parallelized (but must still remain atomic), and so on (supporting unbounded nesting is essential for software composition of algorithms).
Is Fractal / Swarm just an accelerator?
We believe that Fractal and Swarm are a more general solution to some of the hardest parallel computing problems. We have shown that the ideas presented in these works are applicable in a wide range of domains ranging from databases, to discrete-event simulation, to machine learning, and graph analytics.
When will we see this in hardware?
Currently, we evaluate our ideas in simulation. But our group is working on prototyping this on an FPGA in the coming year. So stay tuned!
1) Does this require the programmer to rewrite software or make annotations? Even if that's the optimal way, can a compiler do this automatically, at least to some extent?
2) how general of a workload can this accelerate? You mention speculative multithreading and HTM in both papers, both of which are fairly general purpose techniques.
We are currently working on a compiler that can take serial code and automatically annotate tasks and domains. Keep watching our webpage for updates on that.
2. Our execution model subsumes both Transactional Memory (TM) and Thread-Level Speculation (TLS). Any program that can be expressed in those paradigms can be expressed in Fractal. And as we show in the paper, Fractal can express parallelism that was not possible in TLS or TM. So we believe it is general purpose. Further. as indicated above, we have seen these ideas to be applicable to applications in a wide range of domains.
That being said, we are always on the lookout for applications that reveal deficiencies in our execution model, to help improve it :)
2. Absolutely, SpMT has been proposed as a general purpose technique, that's why I found it odd that the benchmarks weren't on a general purpose benchmark like SpecINT, do you have any benchmarks on SpecINT or any plan to do so?
(see page 8, Table 2: Configuration of the 256-core system)
There are fairly good cycle accurate open source x86 simulators, such as PTLsim or marss86
Anyway, for me the rub comes in with the fact that historically it's taken a really long time for mainstream software to exploit new instruction sets or architectures.
More importantly, Dennard scaling has basically already dropped dead since around ~40-28nm, although we did get a "pseudo-continuation" in the form of FinFETs and friends.
That said, this is not an 88x speedup for general purpose computing, just for some specific workload of parallel compute.