My advice for the author and anybody else considering this is to break out the confirmed-correct assembler code into its own non-Go object and then link it in; otherwise, you're depending on the compiler to never change and inadvertently introduce branches. Since the functionality of the code wouldn't change it would be difficult to check with a unit test. (I guess you could add one that did the assembly step and then grepped for jump instructions, but that has problems of its own.)
Author here. Thanks for the advice! That's a very good point.
Maybe it's worth doing a follow-up on this? I originally thought of doing unit tests to ensure it's actually running in constant time for different inputs, but I'm not sure they would run well on anything that wasn't bare-metal (probably too inconsistent).
Have you done this with Go or any other compiled language? I remember a colleague of mine was looking into C++ assembly generation rules a few years ago, but I haven't heard how far he got.
So I looked into this and Go doesn't exactly make it easy.
The process would be to break out the hand-reviewed (but not necessarily hand-written) assembler for this function into a separate source file for each supported architecture (probably x86-64 and ARMv? nowadays), compile from assembler source into an object file, then link the final Go binary together using the object file and the rest of your app's Go source.
However the Go toolchain doesn't make this particularly easy. I think it might be necessary to use the cgo package[0] and import these objects as C-style functions, even if they originally came from Go source. (If somebody has a simpler way, I'd love to know!)
I think calling conventions for pure Go functions differ somewhat from the C ABI, so you might have to tweak the assembly code to follow C calling conventions as well—with the caveat that I haven't dived deeply into this, and could be all wet.
This method would be "fragile" in the sense that toolchain changes would risk breakage with every Go revision, but at least it would be your build that breaks and not your code's timing-attack security.
Usual disclaimer: The branch predictor is one of the things that will make your program slower, so keep it in mind, but it's the memory stupid(!) so if the powers that be want a faster program in a few days focus on memory layout and cache usage first.
yes, thank you. improving cache efficiency is by far the biggest single thing you can do to increase performance. if the code and data for both outcomes of an 'if' are in L1 cache, that 'if' is never going to be slow.
That's still dwarfed by a cache miss if you're unlucky.
I just found a bug (slow code is a bug) in the D backend where bad data layout led to 32 MILLION LLC misses (85% of the whole program) coming from one line!
Think about how much of a cacheline you are using per iteration folks.
The kind of data that is going to generate (enough) cache misses (to be a problem) behind your back is usually the stuff which you can't put on the stack.
Yup, and go is particularly bad for this because it handles allocations automatically (and poorly). I can double a go program's performance by going through the memory profile and rearranging the instructions to minimize hidden applications.
The worst offender is slices, since you can't mark them read only or stack allocated.
People need to focus on solving business problem first. "memory layout" and "cache usage" should be abstracted with premade data structures and algorithms into a library. You could then use "CacheFriendlyHashMap" class as a drop in. Does such library already exist for any language ?
>You could then use "CacheFriendlyHashMap" class as a drop in
It's impossible to have a generic cache-friendly datastructure because the optimal datastructure depends on access patterns. E.g. if I have a collection of struct{x:int, y:int, z:int, w:int}, and I know the two main usecases are indexing by x, and iterating over the collection performing some opp on (y,z,w), then the ideal datastructure is one where all the x are contiguous in memory ([x1, x2, x3...]) and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).
> and separately the y,z,w triples are contiguous ([y1,z1,w1,y2,z2,w2,y3,z3,w3...]).
Wouldn't having ys, zs and ws occupying different cache lines be good enough? After all, the CPU only wants fast access to these data. Or maybe it's a hard thing to do for CPUs to fetch from 4 different lines at a time (+1 for the instructions)?
I have a hard time imagining a programming language where memory layout of your data can be abstracted away like that. You need to think about which data to group together in memory and that depends on the access patterns. It's not enough to replace your binary trees with cache-friendlier B-trees if you still need to access half a dozen of them to perform one iteration of a loop.
JIT for data, or an off-line simulation run pass for compiler could help. We'd probably have this already if it wasn't for the fact that unlike code, a JIT regime for memory can't be universally effective. So for the PLT part of it, maybe domain constrained languages? But that said, that all would bring into userland all the gymnastics that VM is doing with paging, etc. I think the more interesting question is what would an OS + language families that convey semantics/intent to the OS look like?
- Go was designed with simple compiler, because the purpose of it wasn't best performance at all costs, but simple code and fast compilation.
- Instead we watch as people try to second-guess their processor and alter their code to optimize the machine code produced. Defeating the purpose of Go?
- I've actually seen cases where removing the branches slows down code. People have learned that branching is interesting to the CPU, and have decided that no branching means "faster". Actually branching is perfectly fast, when the predictor is right (most of the time). You only want to avoid branching where the predictor does poorly.
Anyway, I think only do these tricks in very hot spots of your code, and backed by a solid benchmark to confirm improvement.
The "branchful" version is not only slower for the processor, it's also slower for a human - this one, at least, would never write code like that; the verboseness and repetition just screams "you're doing it wrong". When I see such duplication, it slows me down because I have to inspect each case to determine that there's not one that's subtly different. I would at least use a loop.
Also, shifting right by n and picking off the least significant bit (&1) may save an instruction or two, depending on the processor.
Finally, an array of bools is itself intrinsically wasteful[1], as the processor can easily test whether a certain bit is set, or set and clear bits, with a single instruction if you keep them packed them together into bytes. There's another comment here about memory layout and cache usage.
[1] It reminds me of the questions "what's the fastest way to generate/parse <bloated text format>?" in an application where you control both ends and a human almost never needs to see the data.
This must be some strange new definition of "branchless" with which I'm not previously familiar; the generated code is full of branches, conditional or otherwise. I don't think invoking some other tool to put branches into your code qualifies you as branchless.
Have you compared your parser to simdjson-go? I haven't looked specifically at the go rewrite, but I hear it's decent.
I was very excited to do branch free coding to parse JSON, but really only handled the 'lexing' portion of the task in branch free fashion. Certainly more of the task could be done branch free, especially if you are working on SIMD or GPGPU.
I apparently had a fundamental misunderstanding of the meaning of branchless. That's embarrassing.
As for simdjson-go, I did benchmark it. rjson outperformed simdjson-go in most benchmarks, but simdjson-go was about 3% faster reading citm_catalog.json.
I see on your bio that you are one of the simdjson authors. I hope you will indulge a question about it. I am generally more interested parsing a large volume of json documents efficiently than I am in doing it quickly. Since simdjson uses parallel instructions, would that mean that the speedup comes at the expense of being able to process more documents in parallel on the same hardware?
I'm not sure I believe your benchmarks, honestly. They are way slower than the C++ version. But I don't really have a horse in the race as simdjson-go is its own thing. Still, I'm mildly surprised that a byte-by-byte parser is the same speed as a SIMD approach, and usually when someone tells me this, the answer is usually that they buggered up the benchmarking. I would suggest comparing your numbers to the simdjson paper or whatever brag numbers simdjson-go has as a reality check; one of us is clearly wrong.
The use of parallelism in simdjson is the use of SIMD instructions, not multiple cores. So the answer is "no" to the second question.
I love me some branchless coding tricks. It's fun. My favorite is how you can multiply a bool by integers and use masking to conditionally set variables without branching. Just make sure that's defined behavior in your language. Casting bools to ints and bit shifting them can also be used to conditionally bit pack without branching. Lots of fun.
(Might not be worth it if your architecture has conditional move instructions and your compiler is smart enough to use them.)
Can anyone explain the “why does this matter” paragraph where the author seems to suggest that using branches in our program is a security risk?
I know that branch speculation can be used as an attack vector if our program is the aggressor - but does simply using branches in some way make us more likely to be the victim?
Say that your server has an admin login that simply compares the submitted password to "thisistherootpassword", one byte at a time. The attacker starts by trying "a","b","c", etc. and measuring the time it takes for the server to respond. When they get to "t", the server takes slightly longer, so they start trying "ta","tb","tc", etc. until the server takes even longer to respond. This allows the attacker to crack the password much, much faster than naively brute-forcing the space.
Actually, I have an even better example: I performed exactly this attack in order to get 1st place in a programming competition. You can read my writeup here: https://lukechampine.com/progcomp.html
This is regarding a JWT[0] which is often used for authentication.
Server-side code which takes a different amount of time depending on what bits are set in the JWT (or any similar authentication token) can be probed by repeating the operation with different values. Think of lockpicking—if you can move a pin and hear a click or feel more or less resistance, you know you've poked something critical in the core.
One classical example of this is a side-channel attack on RSA. In brief, the central computation is to raise a number to the power of a secret key. This is naively accomplished through a square-and-multiply algorithm -- where multiplications happen for every 1 of the binary representation of the secret key. By watching the power consumption of a device as it encodes/decodes a message, you can read off the secret key with relative ease.
In that case, the code would probably work like the C code example I gave. So... I guess you're right. It _can_ be done in Go without `if` if we use unsafe. I can't argue with that =)
In general, I try to to avoid using unsafe as much as possible. Not because I worry that it's "unsafe" or prone to errors (I'm experienced in C, C++, and assembly), but because it makes the code harder to read, and I don't think my coworkers would appreciate that. ;)
Clang generates slightly better code than Go, I think, because it fuses the mov and or together into an lea (demoting or to add in the process): https://godbolt.org/z/1Gz5KzrzK
I'm unsure whether there is a perf difference, though, as register-to-register mov may get boiled away to nothing in register renaming.
Does go support "__builtin_expect" similar to gcc?
I optimized tight loop of network packets forwarding code with "__builtin_expect" and "inline" to get rid of at most all the branch delays and able to get > 10X increase in performance.
If Go does have some kind of "__builtin_expect" equivalent, I can't find it.
I'd say it probably doesn't exist. Even if it _did_ exist, the Go compiler authors would reserve it for internal use only, like they do with most other Go compiler directives.
Instead of a "minimum wait", could you implement a random wait? Some random number of ns/ms between calls. Something that's enough to make any timing attack measurements unusable?
Random waits only increase the number of measurements an attacker needs to make, they do not eliminate the side-channel leakage. Intuitively the attacker can take many measurements and average them out to eliminate the randomness introduced by the waits, then extract the bits of information from the side channel.
A random wait still leaks, but if you can guarantee a reasonable upper bound on the time required, always waiting until that upper bound has passed before responding does not leak.
Here's a secret number plus or minus 10: 98, 104, 101, 110, 93...
You'll never guess what the secret number is. /s
The entire purpose of statistics is to uncover that secret number, and it works if you have enough data. Changing the random plus or minus won't save you.
The time it takes for your function to complete is now a continuous random variable X + r where r is the real time it take for your function to complete.
E(X + r) = E(X) + E(r) = E(X) + r
E(X + r) is the sample mean which is easy to calculate
E(X) = 1/2 * (max(sample) - min(sample))
I never got round to trying it but I was wondering whether you could generate a Cauchy-distributed wait as a way of breaking badly written benchmarks/timing gadgets (i.e. the mean is indeterminate)