
Go Performance Tales - signa11
http://jmoiron.net/blog/go-performance-tales/
======
jws
I found the bit about Go using AES native instructions to accelerate map key
hashing most interesting. This accounted for a >50% throughput increase when
he found aws c1.xlarge instances which had AES-NI as compared to those that
didn't.¹

This is the kind of detail most developers would not be aware of, and to be
fair, even now knowing it exists the only reference I can google up at
golang.org is the C source code of runtime/alg.c where you will see…

    
    
        23		if(use_aeshash) {
        24			runtime·aeshash(h, s, a);
        25			return;
        26		}
    

… no hint that it might reduce your hosting costs by 33% or account for some
huge variation in performance between one test machine and the next, or even
individual runs if you are spinning up cloud instances to do your testing.

␄

¹ Does your CPU have the AES, SSE3 and SSE4.1 cpu capability bits all turned
on? If so, you will hash mightily! Do you even know where to look to check?

~~~
ominous_prime
I had a similar realization recently, where I would occasionally get an ec2
instance that would cause my server to core dump.

Turns out that a library I was using set `gcc -Mnative` by default, which was
emitting some instructions which weren't valid everywhere. I was requesting
the same virtual hardware, but the physical variations caught me off guard.

------
DrJokepu
I find it interesting how insightful, technical articles like this receive
hardly any comments while the usual "softer" articles that tend to dominate
the Hacker News frontpage these days receive dozens if not hundreds of
comments. I wonder what this says about us, HN readers.

~~~
munificent
There's a difference between comments and upvotes. With articles like this
(which I really liked), I certainly upvoted it but _because_ it's so
technically detailed, I don't have anything useful to add.

~~~
canadev
Same here. I really enjoyed reading it, but I don't understand a lot of what
it is talking about. I also bookmarked it. When the post had 0 comments I was
tempted to post something to the effect of what I just said, but I figured it
didn't really add any value.

------
jsnell
Just a note on the zlib optimization patches. The blog post is linking to an
old version, there's a newer one from a month ago. Also, the patch still
appears to be a bit buggy (e.g. corrupt output being generated by the new
deflate strategy), so don't plan on actually deploying it.

------
SixSigma
If you want more details on Go profiling, this Go Lang blog post is a great
place to look

[http://blog.golang.org/profiling-go-
programs](http://blog.golang.org/profiling-go-programs)

------
stcredzero
I've been looking into _false sharing_ and other pitfalls of parallelism on
multi-core systems.

[http://en.wikipedia.org/wiki/False_sharing](http://en.wikipedia.org/wiki/False_sharing)

It seems like our multi-core hardware isn't that well thought out yet. In
particular, _locks_ can cause false sharing. Even go channels are based on
locks!

I would love to see some kind of annotation in golang for padding certain
structs to cache-line boundaries. This value can be read from most processors,
so it could be done in a cross-platform fashion. The gringo ring-channel
library has to do its own padding to avoid false sharing.

[https://github.com/textnode/gringo/blob/master/gringo.go](https://github.com/textnode/gringo/blob/master/gringo.go)

    
    
        type Gringo struct {
            padding1 [8]uint64
            lastCommittedIndex uint64
            padding2 [8]uint64
            nextFreeIndex uint64
            padding3 [8]uint64
            readerIndex uint64
            padding4 [8]uint64
            contents [queueSize]Payload
            padding5 [8]uint64
        }
    

There was a similar technique for padding Java objects, but it now turns out
that the JVM is smart enough to optimize the padding out!

~~~
jamesmiller5
Interesting point. I doubt go would let something as low level like struct
padding bubble up into the language though, perhaps there could be some kind
of library level "Unsafe" field annotation, though I'd prefer the compiler to
optimize this automatically.

~~~
stcredzero
My first thought is that it's going to be pretty challenging for a compiler to
optimize this without runtime information.

------
yukichan
The biggest performance issues I think that some people run into with Go
involve reflection which seems to be slow. Something that does a lot of JSON
parsing for example maybe could be much slower in Go than in Java, Python or
JavaScript I think. I don't have any data, but I've known people to complain
about it that work with Go. I wonder if a JIT or AOT compiler might help.

~~~
schmichael
There are libraries that generate specialized marshallers to fix precisely
this issue:
[https://journal.paul.querna.org/articles/2014/03/31/ffjson-f...](https://journal.paul.querna.org/articles/2014/03/31/ffjson-
faster-json-in-go/)

~~~
joncooper
That is an interesting project but won't yet help to speed up the (very slow)
JSON unmarshalling.

~~~
schmichael
Indeed, that's on the TODO list at the bottom.

------
timtadh
I have also played around trying to achieve a high performance trie in Go.[1]
My approach is to use the Ternary Search Trie structure. Unfortunately, I have
not yet approached either the performance of the native hash table or a hash
table in Go (although Sedgewick tells us you should be able to beat a hash
table). My TST does not yet have the Patricia Trie optimization (of collapsing
long internal runs). Perhaps with that addition it will get closer to hash
table performance.

Also everything he said about channels also holds true in my experience. I
haven't tried writing a C library for Go yet but his discovery is pretty
interesting for when I dive into that.

[1] : [https://github.com/timtadh/data-
structures/blob/master/trie/...](https://github.com/timtadh/data-
structures/blob/master/trie/tst.go)

------
sagichmal
Good article. It closely reflects the experience we've had at SoundCloud in
our more heavily-stressed services.

~~~
camus2
do you still use ruby and how?

~~~
sagichmal
Sure, the monolithic app at the center is still Rails. We're working to
deprecate it and go all SOA/microservice, but many of those services are
JRuby.

tl;dr still very polyglot here.

------
awda
> All of our existing indexes were based on string data which had associated
> integer IDs in our backend

You already have a perfect hash function :).

~~~
greiskul
The IDs might be sparse though.

~~~
sltkr
Sparse is fine if they are more-or-less randomly distributed.

~~~
awda
I was mostly aiming for unique -> no hash collisions.

------
logicchains
This is probably a stupid question, but I wonder if the author could have used
slices instead of maps with integer keys. It would have used more memory, but
it would probably also have been significantly faster. A significant
proportion of the performance issues I see raised on the Go mailing list seem
to involve maps.

~~~
cronos
Probably no. Author says that he uses IDs as keys. I can assume that they are
32 or 64 bit long. So even if they are 32 bits, you would need a 4GB slice to
hold any potential key. Or make a dynamic array (well, slice) that would give
even more overhead than a map (and would closely resemble what map does
internally better anyways). So only if those keys were really small, that
would be a possible alternative.

~~~
awda
If 'select min(keycolumn) from footable' is relatively stable and keys are
created ascending (databases do this, right?), a resizing slice (with 0-min()
elided from the bottom) seems like a good alternative to a hashing structure.

If the key-space is relatively sparse, this will waste more memory than the
hash, of course.

------
wolf550e
Do they have to use zlib? For on the wire compression LZ4 is better, for
archiving lzma2 or zpaq is better.

------
ihsw
> Using a map[int]* Metric instead of a map[string]struct{} would give us that
> integer key we knew would be faster while keeping access to the strings we
> needed for the indexes. Indeed, it was much faster: the overall throughput
> doubled.

I'm a little sceptical of this -- type assertions are fast but it's an extra
step to initializing a struct. It would have been nice to see tests done
comparing map[string]struct{} to map[int]struct{} and comparing map[string]*
Metric to map[int]* Metric.

Also, there is no way to escape an asterisk, so I apologize for the awkward
space after each one.

~~~
vetinari
My understanding of this section was, that in * Metric they keep the original
string ID and struct{}. If they went straight to map[int] struct{}, they would
lose the string ID.

