I think dup2 is the hint, but in the example case the dup2 path isn't invoked--it's conditioned on passing an argument, but the test runs are just `./a.out`. IIUC, the issue is growing the file descriptor table. The dup2 is a workaround that preallocates a larger table (666 > 256 * 2)[1], to avoid the pathological case when a multi-threaded process grows the table. From the linked infosec.exchange discussion it seems the RCU-based approach Linux is using can result in some significant latency, resulting in much worse performance in simple cases like this compared to a simple mutex[2].
[1] Off-by-one. To be more precise, the state established by the dup2 is (667 > 256 * 2), or rather (667 > 3 + 256 * 2).
[2] Presumably what OpenBSD is using. I'd be surprised if they've already imported and adopted FreeBSD's approach mentioned in the linked discussion, notwithstanding that OpenBSD has been on an MP scalability tear the past few years.
Now time to read the actual linked discussion.