
Go-benchmark: Golang benchmarks used for optimizing code - pjf
https://github.com/cornelk/go-benchmark
======
throwaway2016a
I personally wouldn't do any of these for an average project. They seem like
micro-optimizations. But this is Missing the largest Go optimizations I can
think of and I would do. String vs byte[] and memory allocation.

A lot of Go programs spend a lot of time converting between byte[] and string.
But the conversion is actually really slow and keeping your string data as
byte arrays is much faster and since many go libraries can work with either,
the conversation is not necessary.

Also, memory allocations and garbage collection are not free. If you have a
hot path that gets hit every request or in a loop, it can save a lot of time
reusing the same slices assuming you can do so without introducing bugs and/or
multi-threading issues.

~~~
notheguyouthink
Dumb question, but can you give an example of using []byte over string?

I've got a project where strings aren't exactly crucial, but we're storing a
ton of them in memory. We store them as string and, frankly, rarely access
most of them - though we do do some compares of strings on the hot path.

Using []byte sounds interesting, but troublesome at the same time. Namely the
fact that byte is useless for reading characters from, as you'd eventually
have to get runes from it anyway no?

Thoughts?

~~~
nemo1618
[]byte is crucial if you need to frequently modify the contents. The classic
example is building a string with a loop that appends a new character on each
iteration. Using += to concatenate strings inside such a loop will be slow;
the Right Thing To Do is to preallocate a []byte with sufficient capacity,
append to it in a loop, and then (if necessary) convert to a string only at
the end.

~~~
coldtea
Doesn't Go have a mutable char/rune array?

~~~
jerf
Yes, it's just []rune, which has just a bit of special sauce in the runtime
that allows you to convert it directly to a string [1], and some functions
that convert UTF-8 to and from it. However rune is a 32-bit number sufficient
for storing a Unicode code point. Generally you're dealing in UTF-8 and prefer
to just pass that through code. I think I've only ever use a []rune once, for
when I was doing some very heavy-duty Unicode-aware code, and I still ended up
refactoring it into using UTF-8 directly in the end.

[1]:
[https://play.golang.org/p/5DEzw85J5Ob](https://play.golang.org/p/5DEzw85J5Ob)
Note this is a _conversion_ ; the string is UTF-8 and the runes are 32-bit
ints, so this creates a new string.

------
flgr
There's some context missing here. AFAIK the author of the repositories was
specifically doing micro-benchmarks here to inform the design of his
[https://github.com/cornelk/hashmap](https://github.com/cornelk/hashmap)
library. I think for that use case (designing a high-performance lock-free
data structure), that kind of micro-benchmarking makes a lot of sense.

That hash map looks good and I'm thinking we'd probably benefit from using it
on the hot path of some code we have that needs to be highly scalable.

(Both links were submitted to HN, but only this one seems to have landed on
the front page.)

------
sesteel
Somewhat related; I was building a game a few years ago and passing 4x4
matrices (128 bytes) by value was faster than pass by reference. I don't know
exactly how many bytes can be passed by value before it becomes slower, but
please don't read this and mindlessly assume pass by value is slower.

~~~
Cthulhu_
Also, take GC pressure into consideration; pass by value is 'free' in that it
does not put data onto the heap and thus does not cause any GC overhead.

------
friday99
This is not a very good example of benchmarking. First, all of the benchmarks
need to be run multiple times e.g. -count=5 at least to get reliable results.
Second, benchmarking like this is really meant for analyzing a single
operation. It confuses the analysis when a repetition of 1024 operations is
benchmarked so that the numbers have to be divided by 1024 to get comparable
numbers. The golang benchmarking library can reliably measure operations of a
few instructions long, so no need to repeat them N times for measurement.
Sometimes you will want to see how an operation changes based on something
like length, e.g. hashing 8 versus 128 characters, but in those cases the
operation is "hashing" and you should have separate benchmarks to measure
performance versus different lengths, which is a great use case for sub-
benchmarks r.Run(). Finally, as others have mentioned, most of these are
micro-optimizations that make your code less maintainable, e.g. you should
always iterate over a slice with range unless you have some logical reason not
to.

------
leaveyou
>>Writing and reading an int atomic

>>BenchmarkAtomicInt64-8 5000 354907 ns/op

that ns/op looks too big to be only for an int64 ..

~~~
dis-sys
Almost all those ns/op numbers were measured by repeating BenchMarkSizeLong or
BenchMarkSize iterations.

------
mickronome
Each op must be around 1000 iterations, because that's how much they seem the
ns/op seem to be off.

~~~
shabbyrobe
Yep! 1024. Good eye! Some tests, like the atomic ones, use an even larger
constant (which explains the crazy 355971 nsec/op result for an atomic
increment):

    
    
        BenchMarkSize     = 1 << 10 // 1024
        BenchMarkSizeLong = BenchMarkSize << 5

~~~
jerf
Yes, that one really caught my eye because if an atomic int increment took 35
microseconds, I have some code that would have exploded a long time ago.

I'm not sure what the idea is there, the benchmark system can accommodate
small operations, I've used it before on things like single increments or
single interface calls and gotten reasonable answers.

Also benchmarking atomic int increment without any contention is not
necessarily "useless", but certainly not a full picture if one is
investigating using it in a contended data structure. (My usage of it is
mostly just counters, where I expect thousands upon thousands of instructions
to be between the increment of any given one of them, meaning there's probably
no contention to speak of even on production multi-core servers, so I have no
idea how they perform under serious contention. But AFAIK it's just the
hardware-supported atomic instructions, so what I don't know about is the
hardware, not really Go per se.)

~~~
jasonwatkinspdx
> I'm not sure what the idea is there, the benchmark system can accommodate
> small operations, I've used it before on things like single increments or
> single interface calls and gotten reasonable answers.

Yeah, I've been wondering if I'm missing something, because adding a second
redundant loop to b.N is common in go repos, but seems pointless and
needlessly obscuring.

------
AYBABTME
These results should be taken with a big grain of salt. A lot of stuff true in
this code will not be true at all in YOUR code. This benchmarking code, for
instance, is trivial and many parts will get optimized by the compiler. This
likely won't be true in your more complex code.

~~~
HankB99
That can be a real pitfall for benchmarking. I recall looking at some
promotional material touting the performance advantages of Rust against some
other language (C?) It involved a loop so I kept increasing the loop count to
try to get the Rust execution to take the same time as the other language. It
took me a while but I finally figured out that the way the code was
structured, the Rust compiler could see that the results of the computation
were never used and the loop was simply optimized away.

Benchmarks can measure execution time of algorithms or if not careful, the
ability of the ability of the compiler to optimize away unnecessary code.

(IMO the 'performance comparison' that was vastly in favor of Rust was a bit
disingenuous.)

------
cooervo
So we should avoid defeR?

~~~
arendtio
I was wondering too. Does anybody know what makes defer so expensive?

~~~
dis-sys
5300+ ns/op is for BenchMarkSize (1024) defer operations, so we are actually
paying about 5ns/op for defer. Not sure why the author chose to report the
ns/op number in its current way.

~~~
jerf
"so we are actually paying about 5ns/op for defer"

Yeah, it's in the class of "things you should know about when optimizing tight
loops", but not "things you should always be worrying about"; remember 5ns is
on the order of a branch mispredict or an L2 cache miss [1]. If you haven't
already squeezed out all the main memory accesses you can from your algorithm
at 100ns a pop, or if you're in code dealing with networks (routinely _micro_
seconds, even within datacenters; _milli_ seconds if you have to leave), or
files, or anything else like that, micro-optimizing defers is not going to
have any visible results to your speed but can very badly damage your code's
correctness and ease of writing and modifying.

The other thing to know about defers is that they are not declarations, they
are instructions; every time the program counter moves past one, another defer
call is added to the stack of calls to make while exiting the function. So you
do have to be careful about deferring in a loop, not so much because it's
"dangerous" but just because it's easy to be unaware that the defers are not
declarations. If you intend to do it, it's fine, it can just bite you if you
do it accidentally.

[1]:
[https://gist.github.com/jboner/2841832](https://gist.github.com/jboner/2841832)

------
lmb
That repo encourages mindless microoptimization. Don't waste your time
worrying how to iterate a slice. Most of the time it makes no difference.

~~~
wmsiler
That repo doesn't encourage anything. It's presenting data that the author
found useful for another project they were working on, which presumably did
benefit from micro-optimization. You are right that micro-optimization is
usually not needed, but the repo and its author never claimed otherwise.

------
__aeneas
This should also show that append is really, really heavy when the slice is
not pre-allocated and should be avoided whenever possible!

------
majewsky
Sure, "premature optimization is the root of all evil" and so on. But what I
find much worse, is that hashing algorithms are put side-by-side without any
context, making it look like MD5 is better than SHA-2. That's downright
irresponsible.

