It's hard to beat SimdJSON + io_uring. Other implementations should provide a port of the former or equivalent/faster implementation and too rely on io_uring, with epoll being a viable competitor in limited set of scenarios. I would expect Python being also a bottleneck here with other compiled languages (besides C/C++) that have good interop and ability to write vectorized code having an upper hand like Rust, C# or Go.
You are right! For the convenience of Python users, we have to introspect the messages and parse JSON into Python objects. Every member of every dictionary being allocated on heap.
To make it as fast as possible we don't use PyBind, NanoBind, SWIG, or any high-level tooling. Our Python bindings are a pure CPython integration. There is just no way to beat that combo, not that I know.
How large? Also I'm not sure the gRPC C++ server implementations you've tested are the fastest. If you're comparing to FastAPI (which is more of an HTTP server framework) then you should also compare to what is at the top of https://www.techempower.com/benchmarks/#section=data-r21.
1 - https://github.com/squeaky-pl/japronto