That is certainly a different way to think about threading. My instinct would be to keep a separate JIT cache for each thread, keeping the threads from stealing resources from each other but also keeping them from sharing the JIT effort.
I think your approach is pretty much how /lib/ld-linux.so.2 deals with the PLT. I didn't see an explanation for stub 3, but I guess you mean the part of the JIT output that does the indirect jump. I take it that you believe the speed advantage of a direct jump is not enough to overcome the occasional hit caused by modifying it, with all the invalidations (icache, trace cache, etc.) that it might cause. Being similar to the dynamic linker is probably well-aligned with what Intel is trying to optimize for upcoming chips. OTOH, the spectre problem might limit what Intel does.
Putting the JIT in a separate process would have to add latency. Upon hitting a missing chunk of code, translation can't really wait. The idea is to run modern stuff, such as a recent desktop OS, via the JIT. We do use more than one mapping on Linux, as required to keep SE Linux happy.
With regards to latency, if you have a dedicated JIT thread, you can read ahead and JIT the start of all the jump/calls up to the next current ret in parallel to executing the first chunk. Like a super fancy prefetch. Heuristically, I think it's safe to assume that most code is local but I could be wrong.
Also, if you load hwloc on x86-64, you can read your CPU topology at initialization time. If hyperthreading is present (which is somewhat common), you can set affinity for JIT threads to be on the same core as the emulated CPUs. This will minimize read concurrency overhead since you'll be reading/writing to the physical core's L1 almost every time. (This part is crazy, but you might even be able to get away without using atomics due to how cache-associativity works, but I've never tried it and it might not be guaranteed into the future.)