It's much more than just performance they've thought about. Here are some of the secure programming practices that have been implemented:
/* All the functions in this file are considered "secure", specifically:
- Constant time in the input, i.e. the input can be a secret[2]
- Small and auditable code base, incl. simple types
- Either, no local variables = no need to clear them before exit (most functions)
- Or, only static allocation + clear local variable before exit (fd_ed25519_scalar_mul_base_const_time)
- Clear registers via FD_FN_SENSITIVE[3]
- C safety
*/
libsodium[4] implements similar mechanisms, and Linux kernel encryption code does too (example: use of kfree_sensitive)[5]. However, firedancer appears to better avoid moving secrets outside of CPU registers, and [3] explains that libraries such as libsodium have inadequate zeroisation, something which firedancer claims to improve upon.
These are table stakes for core cryptographic code, and SOT crypto code --- like the Amazon implementation this story is about --- tend at this point all to be derived from formal methods.
As an example, the Amazon implementation doesn't refer to gcc's[1] and clang's[2] "zero_call_used_regs" to zeroise CPU registers upon return or exception of functions working on crypto secrets. OpenSSL doesn't either.[3] firedancer _does_ use "zero_call_used_regs" to allow gcc/clang to zeroise used CPU registers.[9]
As another example, the Amazon implementation also doesn't refer to gcc's "strub" attribute which zeroises the function's stack upon return or exception of functions working on crypto secrets.[4][5] OpenSSL doesn't either.[3] firedancer _does_ use the "strub" attribute to allow gcc to zeroise the function's stack.[9]
Is there a performance impact? [6] has the overhead at 0% for X25519 for implementing CPU register and stack zeroisation. Compiling the Linux kernel with "CONFIG_ZERO_CALL_USED_REGS=1" for x64_64 (impacting all kernel functions) was found to result in a 1-1.5% performance penalty.[7][8]
Zeroizing a register seems pretty straightforward. Zeroizing any cache that it may have touched seems a lot more complex. I guess that's why they work so hard to keep everything in registers. Lucky for them we aren't in the x86 era anymore and there are a useful number of registers. I'll need to read up on how they avoid context switches while their registers are loaded.