Multi-core and multi-threading performance

marshray · on Oct 11, 2012

Amdhal "The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program" was talking about single programs which at that time were generally batch processing applications. Just like today, developers tended to start with single-threaded code and then looked for ways to parallelize it.

When he says "speedup...is limited", he means that in the mathematical sense. I.e., if you had infinite processors the program would not take zero time to complete.

Adding cores is still great when you have many threads and processes competing for CPU time. If your workload is blocking on network or disk, it means you have enough cores. Anyone who would expect additional cores to significantly speed up a single, single-threaded, application is simply in error.

If this is a myth, I don't know anyone who believes it.

stephengillie · on Oct 11, 2012

Anyone who would expect additional cores to significantly speed up a single, single-threaded, application is simply in error. </ snip> If this is a myth, I don't know anyone who believes it.

9 women can't create an infant in a month?

marshray · on Oct 11, 2012

Ah yes, The Mythical Woman-Month.

I stand corrected. :-)

adrianN · on Oct 11, 2012

Once the pipeline is warmed up...

jbooth · on Oct 11, 2012

The point is that even the most parallel programs tend to have bottlenecks where only one thread can proceed. As you add more cores, these bottlenecks increase in relative importance to the parallel parts, which are easily dealt with by your additional parallel capacity.

marshray · on Oct 11, 2012

Did anyone (who writes parallel code or purchases CPUs professionally) really not know this?

I'm not trying to pick on the article, I like it overall. I took an interest in the Anand article when I learned you could build a 64-core box for about $4000.

This is the article it's in response to http://www.anandtech.com/show/5057/the-bulldozer-aftermath-d... the crux of which is:

Some, including sources inside AMD, have blamed Global Foundries for not delivering higher clocked SKUs. Sure, the clock speed targets for Interlagos were probably closer to 3GHz instead of 2.3GHz. But that does not explain why the extra integer cores do not deliver. We were promised up to 50% higher performance thanks to the 33% extra cores, but we got 20% at the most.

So the real questions seem to be:

Is Interlagos' perfomance disappointing on pure-parallel benchmarks as well, or just those with a noticeable serial bottleneck?

If there is poor scalability, what is the cause?

and perhaps most importantly: Will my kind of workload actually benefit from Interlagos' more cores?

I think that probably what's happening is that even the pure parallel benchmarks are not scaling linearly relative to the previous generation of chips, the reason is a combination of architectural changes in the core itself and memory bus contention, and the target market for these chips is VPS hosts running hundreds of VM images.

jbooth · on Oct 11, 2012

I agree with everything you're saying, I just think the 'obvious point' is one that's papered over in the public discourse a lot because it's not exciting.

It is interesting why they don't scale linearly and I agree that it's probably a combination of memory bus contention and/or the chip itself being a little slower.

The whole bulldozer thing seemed like a marketing stunt rather than thoughtful pursuit of higher performance to me. Given the frequency and cost of cache miss rates, Intel HT seems to get the same work done on fewer watts.

marshray · on Oct 11, 2012

I really don't think AMD would have bet their next generation of chips on something they believed was a marketing stunt.

Once the upcoming chip changes get explained to the Marketing Department, however, maybe it kinda starts to sound like Marketing.

frozenport · on Oct 11, 2012

Intel HT is often disabled on HPC clusters due to performance problems. Also many people refer to the IL as a 8 core CPU due to the problems mentioned above. Much like not using the HT, many codes disable half the AMD chip.

jbooth · on Oct 11, 2012

Yeah, but it's really only on highly-optimized cache-coherent vector/matrix math that it's a problem. On just about every other piece of software, it works great.

I feel like AMD saw the benchmarks available for matrix math and went for it, rather than thinking about what real-world performance profiles looked like.

rm999 · on Oct 11, 2012

> even the most parallel programs tend to have bottlenecks

There's a big class of "embarrassingly parallel" tasks, which usually are only limited by how quickly you can move your data to where it needs to be.

marshray · on Oct 11, 2012

What makes something EP is that there is very little data (relative to the CPU time) that needs to be moved.

For example, to compute a nice image of points near the Mandelbrot set, you only need to move the requested parameters (4 floats) to the CPU, let it churn for a long time, and retrieve the resulting pixels (3 bytes each).

dkersten · on Oct 11, 2012

To me, something is EP if it has no or few dependencies between data, regardless of how much the data does or doesn't need to be moved around.

marshray · on Oct 11, 2012

Yeah, that's probably more important. But they're related and the amount of data is not completely irrelevant.

Say instead of the Mandelbrot set we're talking about generating large vectors of random numbers. This task should obviously have no data dependencies and there are algorithms that can generate random numbers much faster than an external CPU bus can transfer them to memory.

If raw bandwidth limitations mean that our CPUs see only 10% utilization is this still a "fully EP" problem? Because it should easily achieve full CPU utilization on most any machine, couldn't we say that Mandlebrot is more EP than the RNG problem?

dkersten · on Oct 12, 2012

I would say that the RNG problem is EP in that each data item can be processed completely in parallel with the others. The fact that there is a memory bottleneck doesn't really say anything about the nature of the problem but rather the hardware its running on.

marshray · on Oct 12, 2012

BTW, thanks for the really interesting discussion!

rm999 · on Oct 11, 2012

I think EP describes a problem, whereas how much data you have to move around is dependent on your implementation. For example, SETI@home is an EP problem that could be run on a multicore machine completely in parallel without moving data. The fact that it is run on many computers over the internet doesn't change that it's EP. In fact, it is exactly because it is EP that it can be run @home.

marshray · on Oct 11, 2012

I agree SETI@Home is very EP. What did I say to suggest it was not?

rm999 · on Oct 11, 2012

I'm not arguing that it's EP, I'm using it as an counterexample to your point.

You can solve SETI by moving data around or without moving data around, but that doesn't change the inherent fact that it is a EP algorithm.

marshray · on Oct 11, 2012

But do we agree that if SETI required moving so much data around that it was clearly difficult to keep the CPUs fully utilized then it would not be an EP problem?

CookWithMe · on Oct 11, 2012

"the most parallel programs" are those that run on supercomputers / huge clusters, and they don't have these bottlenecks...

But for the average parallel programs, you are (currently) right, I guess.

marshray · on Oct 11, 2012

Keep in mind, https://en.wikipedia.org/wiki/Amdahl%27s_law was posed at the very beginning of parallel computing. Seymour Cray later became famous both for his fast uniprocessor supercomputer and his quip "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"

For most embarrassingly parallel problems, a Hadoop or BOINC-like cluster made from commodity hardware may always be cheapest. But "real" supercomputers and clusters will spend the most money to reduce bottlenecks precisely because they want to run those programs that are the most limited by them.

CookWithMe · on Oct 11, 2012

I know, I actually read the paper like a year ago. Actually, what was turned later into Amdahl's law is more or less a side-note in the paper :)

The parent said "most parallel programs", which I translated to embarrassingly parallel problems.

> But "real" supercomputers and clusters will spend the most money to reduce bottlenecks precisely because they want to run those programs that are the most limited by them.

I agree, but from the talks I have heard most time (= money) is spent trying to work around these bottlenecks as good as possible in software (i.e. incremental algorithms, pipelining, ...) to keep the machines busy. Then again, I hear a lot more talks about software than hardware, so you may be very right. If you have any particular example in mind, I would love to read about it!

marshray · on Oct 11, 2012

My impression (formed from growing up as the annoying kid in the datacenter back when they let annoying kids hang out in the datacenter) is that buying new supercomputer hardware takes takes a lot of money and a really long time. Showing a return for the big investment might take much longer than the life of a typical PC.

Once the a computer is actually in place (especially in a University setting), the interesting work begins, trying to write the best code for it and even improve the existing algorithms.

It seems like there are computer purchases motivated by "this machine will run our existing code faster" and those motivated by "this machine will allow us to write code for it that will prove something faster". A supercomputer seems, almost by definition, the latter.

jbooth · on Oct 11, 2012

You've still got a driver program that coordinates the massively parallel stages, most likely, and a less-scalable dispatch process where not all nodes are yet doing work.

CookWithMe · on Oct 11, 2012

Yes, there definitely are programs where this holds true (as I have said), but there are others where it does not. A simple, but perfectly reasonable example is brute-force search for cracking a password (if you got the hash of the password). Your dispatch is simple: send every node 1) the hash, 2) the number of nodes, 3) it's own number. With 2) and 3), each node can easily figure out where it starts. You need no more communication until one of the nodes finds the password.

See https://en.wikipedia.org/wiki/Embarrassingly_parallel

Again, you said "most parallel", and not "average parallel", and this is what I have commented on.

stephengillie · on Oct 11, 2012

The Megahertz-myth was a result of people being conditioned to see the clockspeed as an absolute measure of performance.

It's more a symptom of people looking for one absolute measure of performance, because it's simply easier (less computationally expensive for the human brain) than dealing with a multi-variable valuation problem. IQ is another example, it's easier to deal with just that number instead of considering the whole person. Engine displacement (ie. 2.2L engine) is another - there's a range of variables such as max torque and max horsepower, and which engine speed maximizes those, etc.

CookWithMe · on Oct 11, 2012

The biggest myth is probably that the CPU is the most important part for performance. There are so many other bottlenecks that you can easily run into (drive, memory, memory bus, ...) that blindly updating the CPU (more MHz / more cores) is simply wrong.

majorlazer · on Oct 11, 2012

I think it's pretty incredible to see our processors finally catch up and overtake our computing requirements. If you have a first-gen i5, or even a Core 2 Duo, your CPU is plenty fast to complete 99% of computer tasks out there without feeling any sluggishness. 10 years ago, when people would come to me for advice on purchasing a laptop/desktop, I would tell them to buy the best processor they can afford and then upgrade the rest of the components while they use it. Nowadays, most decent laptops ship with an i5 at minimum, some people ask me if they should upgrade to the i7. I always tell them to save their money and spend it on a faster SSD or more ram (if the ram is <4gb).

ww520 · on Oct 11, 2012

The biggest improvement on current laptop for me is SSD. Man that's biggest jump in performance (perceptive) in a long while.

scarmig · on Oct 11, 2012

I've got a question that I'd love for someone to provide feedback on. Is it the case that a multicore processor must have identical cores?

Let's say there are three tasks, A, B, and C. C depends on A and B, but A is 4x as costly as B. Adding more cores doesn't get around that inherently sequential part, so because of overhead and cost considerations it isn't worth it to parallelize. This applies to most computational tasks: as you add more cores, the effect of the sequential part dominates more and more and you get decreasing benefits.

But what if you have one beefy processor to deal with the heavily sequential parts, and a multitude of less impressive ones to deal with the easily parallelizable parts? Doesn't that give you the best of both worlds?

Is it technically infeasible for some reason?

Someone · on Oct 11, 2012

It is quite feasible. Historically, one could have multiple, different CPUs in one system: http://en.wikipedia.org/wiki/Asymmetric_multiprocessing

Nowadays, those CPUs can be on one chip: http://en.wikipedia.org/wiki/Heterogeneous_computing, http://en.wikibooks.org/wiki/Microprocessor_Design/Multi-Cor....

Also, if one builds an embedded CPU from building blocks, it is easy to put in small CPUs that perform tasks completely independent of the 'real' CPU. For example, there could be a tiny CPU that controls battery charging or tracks GPS. I do not know whether this really is done though; it could be more convenient to use of-the-shelf components for that because they are more suited for the job or because pin-outs typically are too great in demand.

marshray · on Oct 11, 2012

Sounds like AMD may be planning to integrate an ARM on-die http://semiaccurate.com/2011/06/22/amd-and-arm-join-forces-a...

CookWithMe · on Oct 11, 2012

Actually, newer server CPUs have that, in a way. Each core is technically identical, but the performance is not always identical and can be dynamically adjusted.

The most obvious example is dynamic overclocking, where one core (that ideally works on a sequential part) gets a higher frequency (the other cores are usually underclocked at the same time). Intel calls this Turbo Boost: http://en.wikipedia.org/wiki/Intel_Turbo_Boost

In Hyper-Threading enabled cores, you also have varying performance per core, i.e. if one core only hosts one active thread that is CPU-bound the execution is faster than if one HT-core hosts two CPU-bound threads.

So, it is technically feasible, but at the same time makes scheduling for the OS harder.

Edit: And it also makes app performance less predictable. And it's one of the reasons why you need to run serious benchmarks more than a couple of times.

paol · on Oct 11, 2012

Yes, that is feasible and there are real systems that take that approach (millions of them in fact - look up the architecture of the Sony Playstation 3 console if you're curious).

One problem that surfaces is that such asymmetric systems are harder to write software for, which can make them a bit unpopular.

shin_lao · on Oct 11, 2012

What do you mean by a "less beefy CPU"? Once you've been able to design the new CPU and test it, it doesn't make sense to have an older/slower design. You could reduce the cache, but it would probably more expensive to build.

If one of your core is idle, the CPU will load it, if there is work to be done.

bornhuetter · on Oct 11, 2012

I think what scarmig is getting at is - wouldn't it be better to have 1 2ghz core and 3 1ghz cores, rather than 4 1.5 ghz cores.

I don't know the answer myself, and would be interested if someone could clarify why Intel don't make quad cores with one core being faster than the others.

My best guess is that it's hard to keep things in sync if you have different cores on different frequencies or multipliers.

frozenport · on Oct 11, 2012

There is an untold story involving software licenses. Many software are licensed to run on certain PE (cpu core) counts. A 1024 PE license costs you 64 times as much as single host license. Moreover the software costs may easily exceed the core costs.

In these cases the bottleneck is software fees and the easiest solution is to use higher quality cores.

CookWithMe · on Oct 11, 2012

tl;dr Use the right tool for the job. You need to know your tool (CPU) and your job (application(s)).

I.e. an app which is heavily affected by Amdahls law doesn't really profit from more cores, where as highly parallel apps do profit a lot (see reference to Niagara architecture at the end).

What I find strange is that the article first introduces the difference between multitasking and multithreading in a rather lengthy manner, and then uses Amdahl's law, which only applies to multithreading. Especially, there is a gap on what is stated first: > "So multitasking means you are using multiple applications at the same time. Which you always do, these days." With the conclusion: > "With multi-core, your mileage may vary, depending on the algorithms used by the application (to what extent are they parallelized, and to what extent do they have to remain sequential?), and on the performance of the individual cores."

His discussion is all well when you have one app that is really needing the power and the other multitasking apps don't need much CPU (e.g. playing mp3 and such). Which is the case for most desktop systems, I'd say. For servers, it can be really different, and you could either have many processes running or one process is heavily optimized for multithreading and avoids Amdahl's law nicely. Then the discussion is kinda void.

jacquesm · on Oct 12, 2012

You don't avoid Amdahls law nicely or otherwise for those problems to which it applies. It's a mathematical thing, not a technical thing.

Problems that are split up over many cores, processes or even clusters of machines always have some serial component, and that serial component determines the minimum run-time for the whole computation.

So even if all the rest of the work would be reduced to '0' then there would still be that serial component left, whatever it was.

This is not at all about playing mp3s or server loads such as web serving that are embarrassingly parallel and that can be run in a shared nothing environment. It is all about computing solutions to problems where there is some shared state between the various parts of the computation.

In computations like that the serial bits are visible as bottle-necks in the data flow, either because there is a synchronization point and some kind of reduction of the data that can only be done in a serial fashion, an ordering or some other operation that can only be done by a limited number of the total available cores. Those sections, and only those sections will put a lower limit on how fast you can solve that particular problem for a given data set.

CookWithMe · on Oct 12, 2012

You are right, my wording was incorrect. I'll try to put what I wanted to say in the words of Dr. Thomas Puzak, IBM: "Everyone knows Amdahl's law, but quickly forgets it".

That said, Amdahl's law does NOT apply to problems, it applies to programs: https://en.wikipedia.org/wiki/Amdahl%27s_law What I was trying to say is that, from my experience, if you take an existing program and try to parallelize it with minimal effort (e.g. parallelize all loops), then you may end up with having a serious percentage of serial code. (Or you don't, because your data is large enough, see Gustafson's law: https://en.wikipedia.org/wiki/Gustafson%27s_Law ) However, if you put in more effort and re-architect the program, you'll usually worry about a lot of problems, but Amdahl's law is not one of them. Yes, you didn't "avoid" Amdahl's law (and you can find counter-examples where Amdahl's law is a real concern).

I actually heard a talk about ordering in in-memory databases. It basically comes down to Gustafson's law: If your dataset is rather small, the overhead of parallelization (some of which can be attributed to Amdahl's law) isn't worth it, you just do it on one core. But because your dataset is small, it is so fast that nobody cares. If your dataset is large, each core sorts a subset. Then you start aggregating, and the final aggregation is serial. However, because you already have sorted lists, you can use a different algorithm that merges this very quickly (but would perform bad on unsorted data). I can't remember any specific numbers, but the bottom line was that even with a very beefy server that had 4 sockets and plenty of cores, the serial part wasn't an issue in terms of overall performance. Keep in mind that the author talked about a Multi-core myth and not a Many-core myth.

Again, I don't argue that near-linear speedup is possible for most problems. Most often it's not. But naming Amdahl's law as the only reason is wrong.

rheide · on Oct 11, 2012

Very good points. I remember playing Need for Speed 2 SE on a Pentium II 200MHz with an mp3 playing in the background with no problems at all. These days we can emulate several of those machines inside a single OS, while browsing the web and watching a movie at the same time.

on Oct 11, 2012

[deleted]

marshray · on Oct 11, 2012

It probably has to do with the desire to keep gameplay very consistent across all users' hardware. They don't want to give a huge advantage to the enthusiast with a dozen cores over the person playing on their laptop.

But making the graphics better is something that doesn't seem to help your score much. More realistic fog may even make the game a little harder. So it's probably no coincidence that GPUs are leading the way in parallelism.

stephengillie · on Oct 11, 2012

Sorry, if I realized you were replying, I wouldn't have deleted :p

While videogames have been one of the main economic forces pushing processor technology, I've heard that few games can actually use more than 1 CPU core. Is this because of a difference in the way games are programmed? Is it because single-core programming is (I've heard) so much easier, and games are more often written by aspiring novices than professional drones?

maximilianburke · on Oct 11, 2012

That's not true, at least for AAA games. Most modern AAA games make good use of all available processors. This is especially important for games that ship on the Xbox 360 and PS3 because these devices only start delivering impressive results when you go parallel.

It's difficult to come up with a good multicore game engine design and even more difficult to do it on the first try. What I've observed over the years is that these AAA games are steadily becoming more and more parallelizable as they are iterated on.

An architectural pattern that seems to be working well these days is a job/task-based one where the program is expressed as a graph with the tasks (well defined units of work) are connected via a network of dependencies. This model works well because it doesn't force you into requiring certain hardware configurations, it scales well from single core phones up to 32+ core workstations. Now, whether or not your content can scale as well is another question :)

stephengillie · on Oct 11, 2012

That's one confusing part - videogames are supposed to be very difficult to parallelize on CPU but MUST be parallelized on GPU. Is this because the CPU is usually running more complex code with lots of multiple branches and (multiplayer) coordinating through network layers, while the GPU is basically just transforming arrays of numbers into arrays of other numbers? (Arrays which monitors convert to photons for us to see)

marshray · on Oct 11, 2012

Yep. Really nice-looking graphics can be done with computations that parallelize nicely over individual triangle vertices and individual pixels.

The gameplay logic and networking code for most games could probably be optimized for many cores than it usually is. But again, the primary consideration is that it plays well for everybody.

humdumb · on Oct 11, 2012

I liked this blog entry a lot. A wee bit of multi-core/threads anti-hype is healthy.

The comment about people wanting a single absolute measure of something is also spot on.

The biggest bottlenecks to me were (and still are) memory and the network. Mult-cores are nice, but given the choice I'd still trade additional CPU's for more memory or a faster network.

lutze · on Oct 11, 2012

Nice little shoutout to the Amiga there, and why it was such a groundbreaking machine!

Nothing else a home user could buy could do pre-emptive multitasking in 1985... Or even 1995 AFAIK.

jacquesm · on Oct 12, 2012

There was the 'unicorn'.

http://en.wikipedia.org/wiki/Torch_Computers

based on a 68010.

lutze · on Oct 15, 2012

Well, Unix would do it.

Honestly never heard of these though, thanks.

systematical · on Oct 11, 2012

Very interesting read. So if clock speed (when analyzing different architectures) doesn't matter, what should I be looking at?