The way I see it is that before C++11, the semantics of individual threading implementations relied on compiler and platform-specific primitives (like atomics and barriers) that happened to be similar enough to create an illusion that C++ supports multithreading as a language.
C++, for all intents and purposes, supported multithreading before C++11 via extensions and libraries that are vaguely similar to each other on different platforms, but you can never be sure that your multithreaded code will work when ported to another platform (see "volatile" keyword semantics in standard C++ as opposed to MSVC).
But now,
1. it's officially mandated, and
2. you have more tools in your concurrency toolbox (for example, compiler-independent atomic variables with the ability to fine-tune memory order).
Portable thread libraries have existed since userspace threads were invented. That's really not news. But it's true that the C memory model (per the spec) was underconstrained. So various tricks like volatile had to be added, and when those weren't enough the libraries had to drop down to platform-specific assembly code like memory barriers, etc...
The point here is that they're just putting this into the standard and requiring specific semantics. This works because all CPU vendors have settled on a more or less consistent way of doing this -- synchronization is mature technology, basically. But it presumably also means that some older uniprocessor architectures won't be officially supported.
Portable threads libraries worked very well for people who were willing to use mutexes, which are implemented to include the memory barriers required to reason about code. Where it got nasty was when people couldn't afford the performance hit of mutexes (or imagined they couldn't take the performance hit, or just wanted to be ninjas) and tried to write safe code without mutexes by reasoning about how execution might get interleaved. C++ didn't provide as many guarantees about the ordering of side effects as a reasonable person would guess, so that typically didn't work.
Wise people threw up their hands and said, "I'm not a ninja, so I'll just use mutexes and coarse-grained locking, nothing tricky, and see if it's fast enough," and they were mostly fine. I think that's what most wise people will continue to do, and the people who will get the most use of out the new standard will be people who have very special performance requirements or who write high-performance concurrent libraries (such as container libraries) for mere mortals to use.
I don't see why older platforms can't be supported should the compiler vendor decide it's worth the trouble. Any architecture can be made coherent if you sacrifice performance. The compiler can conform to the standard by being very conservative. Just a question of cost.
1. ABI is a set of rules for interaction between multiple code units (functions, translation units, libraries) running sequentially and calling each other, e.g. how arguments are passed to the functions, how return values are passed back, how exceptions are thrown and handled.
2. Memory model is a set of rules for interaction between multiple (potentially) concurrently running threads, e.g. how a compiler can reorder load/store instructions to main memory, which operations imply memory barriers.
IA64 defined an official ABI for C++, other platforms are also adopting it (though that means breaking with any previous ABI), but it remains outside the scope of the C++ standard.
I hate to say it, but is this the only lasting contribution IA64 will have on the computing industry? (I did work on IA64 system software pre-silicon once upon a time)
IA64 is by far the most aggressive VLIW implementation we have. It was significantly ahead of its time, in that it relied on "sufficiently advanced compilers" that still haven't materialized. We'll get there eventually, and future architectures will push complexity into the compiler the way Itanium did when the compiler is able to bear the load. We're already there for easy-to-schedule, well-understood codes (e.g., ATI GPUs are VLIW), but I'm betting more Itanium-like microarchitecture will come out of the woodwork eventually.
In a sense, VLIW/EPIC was one of the only things IA64 was trying to do differently than existing architectures. The chips have a bunch of cool RAS features, but Power5-7 have most of those too.
C++, for all intents and purposes, supported multithreading before C++11 via extensions and libraries that are vaguely similar to each other on different platforms, but you can never be sure that your multithreaded code will work when ported to another platform (see "volatile" keyword semantics in standard C++ as opposed to MSVC).
But now,
1. it's officially mandated, and
2. you have more tools in your concurrency toolbox (for example, compiler-independent atomic variables with the ability to fine-tune memory order).