Didn't read the full paper (might do it later in the day). But it seems like they would need per-task instruction counters in each core. Or maybe even some sort of call-graph monitoring on chip. Otherwise, it'll be interesting to see how they distinguish between (1) parallelized tasks making a kernel call and thereby causing a context switch and (2) one of the parallelized tasks getting pre-empted to say service an interrupt. In case (2), then need to pause the instruction counter for the preempted parallelized task and in case (1) they shouldn't pause the instruction counter for the task that made t he call.
Maybe they propose making the instruction count part of a task's "context". My understanding is that today's chip instruction counters are per-chip and not per-task.
Let's say each of 100 tasks has a 50% chance of stalling for 30 seconds and 50% chance of taking 0.01 seconds, and then they need access to one of 3 resources. If the stall is totally random then the request for the resource does not happen until after a random number of threads finishes.
Writing multi-core code does not have to be hard, but you can't abstract away the problem. You need to understand the dependency flows and code around them.
Maybe they propose making the instruction count part of a task's "context". My understanding is that today's chip instruction counters are per-chip and not per-task.