What is needed is a bunch of concurrently usable STL containers (eg. checks for invalidating iterators on entering items into a map from another thread), akin to the java.util.concurrent library. That way you separate the tricky concurrency stuff out from the people that probably do not understand it.
The limitations I'm thinking of is that it's very common to need to do more than one operation atomically, and common concurrent containers are unable to do that. You can atomically add or remove a single item from a concurrent queue, stack or hash map, and sometimes that's enough. But if you want to iterate a whole container, say searching for all items that meet some criteria, you need a lock.
It's also a huge relief to have deterministic deallocation back in my life. Praise Jesus for that.
It's not everyone's cup of tea, but if you need the speed, it's there for you - with a solid, clean and expressive implementation. Your classic symbolic regression genetic algorithm with halls-of-fame and whatnot is an absolute cinch, for example, and runs gob-smackingly fast with -O3 -march=native, and a good ten minutes of profiling data for pgo.
How does it compare with using lock-free / immutable data structures? IMO this strategy is less error prone and easier to test. I've written multi-threaded C++, but mostly before C++11.
For other applications, where it is only important that things will eventually propagate through, think of adding elements to a vector (which could be serialized in multiple ways anyway) or performing computation on data that is implicitly shared between cores but not actually read by the other core during the operations, a relaxed memory system is sufficient. It's also faster and more power efficient as high-speed core interconnects are a major drain, compared to, lets say the ALU in a modern CPU.
Lock-free structures are useful, but as your data structure gets more complicated and the updates involve more operations, it becomes a very non-trivial task to write a lock-free version of your operation. It's easier to simply lock and perform an 'atomic' update. Immutable structures often have performance overheads which are not desired in the C++ STL.
Also there maybe times in your program where you know that your thread has exclusive access to the variable during a certain segment of the program. During this segment it may be beneficial to do lazy reads/writes then complete the critical section with a release operation.
This is used heavily in single producer single consumer queues.