(There's some work being done to parallelize the rendering pipeline, and a bit on painting, but parallel CSS layout seems to be completely off the table in existing browser implementations.)
The issues that I've seen are basically endemic to every large C++ application I've seen, actually: shared memory being accessed via pointers from many different places, lazily computed values, shared global (or even just per-page in the case of intra-page parallelism) caches.
These are all reasonable things to do, so I'm hesitant to call them "incorrect", but if you want to design with parallelism in mind you have to either avoid them or carefully design around having low contention on shared resources, access to mutable shared resources protected by some sort of serialization mechanism, etc.
As a simple example, last I looked WebKit's CSS selector matching algorithm cached some state on nodes as it went (as does Gecko's). That means that as things stand you can't do selector matching on multiple nodes in parallel because you can get data races. This one instance can be worked around by deferring the setting of the cached bits until a serialization point, but you have to make sure you do this for every single algorithm that touches non-const members on nodes that you want to parallelize... And C++ doesn't really give you a good way to find all such, given "mutable" and const_cast.
Another example: When WebKit does layout-related things, it directly touches the DOM from the layout code, which is a perfectly reasonable thing to do when you think about it. But it does mean that it can't do layout in parallel with running JS that can modify the DOM. For that you need to have layout operating on an immutable snapshot of the DOM.
As far as more info.... https://github.com/mozilla/servo/wiki/Design is perhaps a start for what we think we know about how this _ought_ to work. Maybe. ;)
In theory, that's possible, but in practice it would likely take just as long, and at the end you would have a rendering engine that was years behind the competition. Not least because intermediate stages wouldn't be upliftable back to the main codebase because partially parallelizing things leaves you with worse performance than either parallelizing nothing or everything, in many cases.