Exactly this is why rewrites fail. The challenge of a rewrite is not in mapping the core architecture and core use case, it's mapping all the edge cases and covering all the end user needs. You need people intimately familiar with the old system to make sure the new system does all the weird stuff which nobody understood but had good reasons that was in the corners of the old system's code. IMHO the best way to approach a rewrite is to make a blended team of experts of the old system and experts of the new technology, and you put a manager in charge with excellent people skills who can get them to work together.
More often than not the next developer using the old system worked around it. But never documented it as a bug (or wrong format, it featur! Not a bug!)
Or did your system actually include the OS source?
E.g. I'm working on a system that I could hypothetically abstract (I've got access to it, can poke with enough tests and test data, etc).
However. What I don't have is access to code or test injection into any of my sources / consumers. Both of which are expecting all the corner case quirks to be exactly identical and may actually have accreted software that depends on a specific quirk. A specific quirk that I have no way of knowing about. Or may send me something I've never seen and am not expecting because it's a 1:1,000,000 corner case, and we don't have any logging of an example that came through production.
I haven't worked on too much of the heavyweight stuff like you have, but I tend to take the perspective that "a 100% compatible rewrite is impossible." 95%+ maybe, but we're going to have to deal with the <= 5% after it goes to production.
Did you ever pursue writing and dropping a new tailored load balancer / router type application on the incoming data stream such that you could divert a specific portion onto new system(s)?
If it's something that's never come through production, it is not already documented anywhere, and it has an occurrence rate of approximately 0.001%, is it really a feature that needs to be replicated?
(Currently migrating a site of 125k pages of content with oodles of edge-cases)
Also, too many of the talents are stuck in a blocked I/O mindset somehow.
Some are wizards though, writing assembly and making raspberry pi sized systems blazingly fast. OK a couple of raspberry pi:s
Cobol isn't involved, but those slow mainframes are.
We're not running our modeling engines and that stuff on it. We have HPC for that.
Yeah, but the interesting problem is what happens next. Let's say a test case reveals that there's a flaw in the next-gen system. You fix it. It later turns out that the same flaw exists in the legacy system. What do you do?
Do you revert the fix, or leave it in place?
They did NOT appreciate hearing that they had been running a bugged query for years...
At least there never is for me.