Like a lot of the comments on LWN, I'm having a hard time understanding why this should go in the kernel, instead of just being a simple userspace library. There doesn't even seem to be a performance argument for it, since any userspace library can use iovecs to combine headers and messages to avoid multiple system calls.
At a glance, it seems like delivering a message to one of a pool of waiting processes would be difficult to do in a library. If the process reads a large swath of data it may get multiple messages and a fragment. If it reads a fraction of a message then another process might steal the next few bytes. You'd either end up with a reader lock and small reads to not read past the first message, or introducing a demuxing process and doubling your work.
In the single process case there is also a potential performance win by not waking processes to deliver a partial message only to have the process block awaiting the rest.
That sounds way over-engineered: Custom protocol-aware BPFs... why???
Why not just use the TCP "push" flag that is added exactly once per application-level message on sender side; and on receiver side, buffer in-kernel until "push" is encountered?