Actually, there's an even better solution: simply give the kernel a userspace address range which will be zeroed on fork. There's simply no reason at all why this address range should be restricted to exactly 4KB or a multiple thereof (one could imagine the kernel doing some page-table tricks to avoid a full memset() for large areas, but that's an optimization that can be added transparently to an API that supports arbitrary address ranges).
The futex API (including set_tid_address) is precedence for this kind of syscall.
The futex API (including set_tid_address) is precedence for this kind of syscall.