On modern CPUs atomic adds are now reasonably fast, but only when they are uncontended. If the cache line the value is on has to bounce between cpus, that is usually +100ns (not cycles) or so.
Writing performant parallel code always means absolutely minimizing communication between threads.
Writing performant parallel code always means absolutely minimizing communication between threads.