I wonder whether it would be possible to optimise the Python interpreter to make deep copies copy-on-write. I suppose that would involve a lot of work for relatively little gain.
I remember that being mentioned in a PEP somewhere but it never got implemented. It might be worth implementing copy in C with copy-on-write to bring some of those benchmark numbers down.