The biggest sin generally is to disregard the overhead caused by thread management and thinking more threads makes the sofrware run faster.
I've seen people try to parallellize a sequential program by just spawning mutexes everywhere and then thinking now any number of threads can do whatever they please. Of course when tested the system was quite a bit slower as when it ran on a single thread (the system was quite large so quite a lot of work was needed before reaching this state).
Gah, no. Userspace spinlocks are the deepest of voodoo and something to be used only by people who know exactly what they are doing and would have no difficulty writing "traditional" threaded code in a C/C++ environment. Among other problems: what happens when the thread holding the spinlock gets preempted and something else runs on the core? How can you prevent that from happening, and how does that collision probability scale with thread count and lock behavior?
Traditional locking (e.g. pthread mutexes, windows CriticalSections) in a shared memory environment can be done with atomic operations only for the uncontended case, and will fall back to the kernel to provide blocking/wakeup in a clean and scalable way. Use that. Don't go further unless you're trying to do full-system optimization on known hardware and have a team full of benchmark analysis experts to support the effort.
If you ever used pthread_mutex with glibc then you use spinlocks without knowing it. The implementation spins for some time before going for a full kernel mutex.
"Mutex" on the other hand might have a fast-path that spins a few times before inserting the thread onto a wait list.
The difference is when there's contention. A spinlock will burn CPU cycles but a mutex will yield to another thread or process (with some context switch overhead).
A spinlock should only be used when you know you're going to get it in the next microsecond or so. Or in kernel space when you don't have other options (e.g. interrupt handler). Anything else is just burning CPU cycles for nothing.
Mutex and condition variables (emphasis on the latter) are much more useful than spinlocks and atomics for general multithreaded programming.
Outside of hard realtime code, there's zero reason to use spin locks.
Interesting read: http://www2.rdrop.com/~paulmck/realtime/SMPembedded.2006.10....
That is just not true. You _must_ use them in the case where the kernel is non-preemptable. Additionally, if the locked resource is held for a very short time, a spin lock is likely a more efficient choice than a traditional mutex.
On some common architectures, releasing a spin lock is cheaper than releasing a mutex.
But if you don’t have a guarantee the lock owner won’t be preempted, well, spinning for a whole timeslot is quite a bit more expensive…
 you better be careful with spinlocks and priorities here as you can livelock forever.
That said I'm used to situations where we're pinning thread affinity to specific cores and really trying to squeeze out what you can from fixed resources.
Setting CPU affinity will ensure that you always get the same core, but it might not increase performance and could adversely affect other parts of the system.
CPU affinity is a good fit for continuously running things like audio processing or game physics or similar. It's not good when threads are blocked or react to external events.
In most cases it's just unnecessary because the kernel is pretty good in keeping threads on cores anyway.
Be careful with that. First off, what people refer to as "mutex" is usually a spinlock that falls back to a kernel wait queue when the spin count is exceeded. There are even adaptive mutexes that figure out at runtime how long the lock is typically held and base their spin count limit on that.
Secondly, busy-waiting is often worse than a single slow program, because you actively slow down all of the other running programs.
Qt does it with signals-slots. What I generally do is that I have a queue of std::function and just pass lambdas with the capture being copied.