The timeout argument does not work as intended. The timeout is
checked only after the receipt of each datagram, so that if up to
vlen-1 datagrams are received before the timeout expires, but then no
further datagrams are received, the call will block forever.
I've done similar tests with UDP applications. It's possible to get 500K pps on a multi-core system with a test application that isn't too complex, or uses too many tricks. The problem is that the system spends 80% to 90% of its time in the kernel doing IO. So you have no time left to run your application.
Another alternative is pcap and PF_RING, as seen here: https://github.com/robertdavidgraham/robdns
That might be useful. Previous discussion on robdns: https://news.ycombinator.com/item?id=8802425
The point of all that work to context switch into processes to handle small amounts of network I/O is that very often THE CORRECT SOFTWARE ARCHITECTURE is for multiple address-space-separated processes to be doing small amounts of network I/O. That I/O "means something" to a larger data model being implemented by the software.
It's true that for some tasks that "look like routing" there's no point to having that kind of external data model. The packets are the data being operated on. So there's little value in process separation and you might as well DMA them all streamwise into a single process to do it. And that's great stuff, but AFAICT it's really not what the linked article is about.
Ultimately, all those packets are going to end up in conventional processes, because that's where conventional processing needs to happen. There are very good reasons why we like our page-protected address space separation in this world!
Apparently you can also use the SO_RCVTIMEO socket option, which is a way to specify a timeout for all receive operations on the socket.