
Scala vs. Go TCP Benchmark - rck
http://eng.42go.com/scala-vs-go-tcp-benchmark/
======
dlsspy
Did you consider running the go client against the scala server and vice
versa?

Also, that's kind of a lot of code. Here's my rewrite of the server:
[http://play.golang.org/p/hKztKKQf7v](http://play.golang.org/p/hKztKKQf7v)

It doesn't return the exact same result, but since you're not verifying the
results, it is effectively the same (4 bytes in, 4 bytes back out). I did
slightly better with a hand-crafted one.

A little cleanup on the client here:
[http://play.golang.org/p/vRNMzBFOs5](http://play.golang.org/p/vRNMzBFOs5)

I'm guessing scala's hiding some magic, though.

I made a small change to the way the client is working, buffering reads and
writes independently (can be observed later) and I get similar numbers
(dropped my local runs from ~12 to .038). This is that version:
[http://play.golang.org/p/8fR6-y6EBy](http://play.golang.org/p/8fR6-y6EBy)

Now, I don't know scala, but based on the constraints of the program, these
actually all do the same thing. They time how long it takes to write 4 bytes *
N and read 4 bytes * N. (my version adds error checking). The go version is
reporting a bit more latency going in and out of the stack for individual
syscalls.

I suspect the scala version isn't even making those, as it likely doesn't need
to observe the answers.

You just get more options in a lower level language.

~~~
jongraehl
I think you're on the right track in supposing that there can't be a huge
performance difference in such a simple task, given that both languages are
compiled and reasonably low-level. The most plausible explanation would amount
essentially to a misconfigured library, not a fundamental advantage due to
say, advanced JVM JIT. Your suggestion to try server-{a,b} x client-{a,b} is
also a good one.

Your modified Go server doesn't return "Pong" for "Ping". It returns "Ping".
And the "a small change" version is nonsense. It's fundamentally different. -
you're firing off all your requests before waiting for any replies, and so
hiding the latency in the more common RPC style request-response chain, which
is a real problem.

You speculate a lot ("hiding some magic" "likely doesn't need to observe the
answers") when you haven't offered any insight.

EDIT: Nagle doesn't matter here - it doesn't delay any writes once you read
(waiting server response). It only affects 2+ consecutive small writes (here
I'm trusting
[http://en.wikipedia.org/wiki/Nagle's_algorithm](http://en.wikipedia.org/wiki/Nagle's_algorithm)
\- my own recollection was fuzzy). If Go sleeps client threads between the
ping and the read-response call then I suppose it would matter (but only a
little? and other comments say that Go defaults to no Nagle alg. anyway).

~~~
bsdetector
> The most plausible explanation would amount essentially to a misconfigured
> library, not a fundamental advantage due to say, advanced JVM JIT.

Really, the most plausible explanation? I'd say the most plausible explanation
is that M:N scheduling has always been bad at latency and fair scheduling.
That's why everybody else abandoned it when that matters. It's basically only
good for when fair and efficient scheduling doesn't matter, like maths for
instance, which is why it's still used in Haskell and Rust. I wouldn't be
surprised to see Rust at least abandon M:N soon though once they start really
optimizing performance.

~~~
coolj
Interestingly, both the go client and the scala client perform the same speed
when talking to the scala server (~3.3s total), but the scala client performs
much faster when talking to the go server (~1.9s total), whereas the go client
performs much worse (~23s total, ~15s with GC disabled).

I thought the difference might partly be in socket buffering on the client, so
I printed the size of the send and receive buffers on the socket in the scala
client, and set them the same on the socket in the go client. This didn't
actually bring the time down. Huh.

My next thought was that scala is somehow being more parallel when it
evaluates the futures in Await.result. Running `tcpdump -i lo tcp port 1201`
seems to confirm this. The scala client has a lot more parallelism (judging by
packet sequence ids). Is that really because go's internal scheduling of
goroutines is causing lock contention or lots of context switching?

And...googling a bit, it looks like that is the case:
[https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...](https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw/edit)

> Current goroutine scheduler limits scalability of concurrent programs
> written in Go, in particular, high-throughput servers and parallel
> computational programs. Vtocc server maxes out at 70% CPU on 8-core box,
> while profile shows 14% is spent in runtime.futex(). In general, the
> scheduler may inhibit users from using idiomatic fine-grained concurrency
> where performance is critical.

~~~
jongraehl
Interesting, but now I'm even more confused. How can we possibly explain that
a (go client -> go server) (which are in separate go processes) performs far
worse than (go -> scala server), given that the go server seems to be better
when using the scala client?

The comments on the article page have a different report which doesn't suffer
from this implausibility:

go server + go client 22.02125152

scala server + scala client 3.469

go server + scala client 3.562

scala server + go client 4.766823392

~~~
coolj
> Interesting, but now I'm even more confused. How can we possibly explain
> that a (go client -> go server) (which are in separate go processes)
> performs far worse than (go -> scala server), given that the go server seems
> to be better when using the scala client?

I've been curious about that as well. The major slowdown seems to be related
to a specific combination of go server and client. I don't have a good
explanation. I'd love to hear from someone familiar with go internals.

> go server + go client 22.02125152 > ... > scala server + go client
> 4.766823392

That's roughly equivalent to my numbers.

------
est
> The experiments where performed on a 2.7Ghz quad core MacBook Pro with both
> client and server running locally, so as to better measure pure processing
> overhead. The client would make 100 concurrent connections and send a total
> of 1 million pings to the server, evenly distributed over the connections.
> We measured the average round trip time.

Another let's rape localhost:8080 on a MacBook Pro™ benchmark

~~~
jdub
Please don't use the R word, which is so deeply emotive and meaningful, in
such an inappropriate context.

~~~
fleitz
Rape has meanings other than non-consensual sex.

GP: Please use the word rape to the full extent provided by the english
language, and let those who don't like the word deal with it.

As for the benchmark, who writes their own load balancer? Isn't this generally
a solved problem? If the point is to extract max performance from a simple
ping-pong server then I'd go right to C and epoll/libevent/etc. I'm guessing
that the team is somehow trying to extrapolate data from a ping-pong server to
the actual problem they're trying to solve which is dare I say, stupid.

In general the best way to solve this problem is use whatever language the
team / person writing the software likes because having them write it faster
generally outweighs whatever server costs one will run into, whenever this is
not the case the best answer is 99% of the time: write it in C.

~~~
aninhumer
It doesn't matter whether there are other meanings. The fact remains that
there is a single meaning which is understandably deeply upsetting to some
people. I'd imagine you'd avoid topics that upset your friends in real life,
and some of us like to extend this courtesy to strangers on the internet as
well. If you feel your "full extent of the English language" is more important
than the emotions of strangers, that's your choice, but forgive me and others
for judging you as an asshole.

~~~
coldtea
> _It doesn 't matter whether there are other meanings. The fact remains that
> there is a single meaning which is understandably deeply upsetting to some
> people._

People getting upset by mere words (not even uttered against them) are hardly
worth a hackers time.

Also notice how you, the oh-so-sensitive to the "emotions of strangers" and
the "meanings that upset people", called him "an asshole" (for merely
suggesting the use of a word). Way to go for tolerance.

~~~
vectorpush
_People getting upset by mere words (not even uttered against them) are hardly
worth a hackers time._

Get over yourself. Taking on the title of hacker isn't some prestigious
achievement, it's a self-aggrandizing social signal for tech hipsters. Beyond
that, every caliber of individual can be upset by 'mere words', including
yourself.

~~~
coldtea
> _Taking on the title of hacker isn 't some prestigious achievement, it's a
> self-aggrandizing social signal for tech hipsters._

This forum is called "Hacker's news" for a reason. And that reason predates
"tech hipsters" by 40+ years. It's not an achievement, I'll give you that. But
it IS a culture, and that culture doesn't take self-censorship and puritan
values very well...

> _Beyond that, every caliber of individual can be upset by 'mere words',
> including yourself._

Being upset when some words are targeted at you or at people you do not think
deserve such treatment is normal. It's being upset just because of the use of
words that's prudish and bad.

~~~
vectorpush
_" that reason predates "tech hipsters" by 40+ years"_

 _" it IS a culture, and that culture doesn't take self-censorship and puritan
values very well..."_

These days, the title of 'hacker' is akin to the title 'patriot': everybody
knows what a _real_ one looks like and they're all too happy to monkey patch
their own arbitrary components into the definition. Last I heard, there is no
general stance regarding self-censorship in the hacker community.

Also, I think its an impressive type conversion for you to cast what is
commonly described as overzealous liberalism to puritanical religiosity. There
is nothing puritanical about respecting the sensitivities of sexual assault
victims.

~~~
coldtea
> _Also, I think its an impressive type conversion for you to cast what is
> commonly described as overzealous liberalism to puritanical religiosity._

Well, I don't consider it that impressive.

Political correctness is just one method the liberals found to maintain the
puritanical religiosity of their past. Just the secullar side to the same
coin.

You cannot get puritanism out that easily, you just divert it from religious
thinking to other endeavours.

We have the same kind of conversions in Europe too -- not to mention that it's
a kind of well discussed topic in literature and psychology.

------
bad_user
So even if Scala is ahead in this (flawed) benchmark, that's not how you write
a TCP server in Scala, because you want to do it non-blocking. Not doing it
based on asynchronous I/O means that in a real-world scenario the server will
choke under the weight of slow connections, not to mention be susceptible to
really cheap DoS attacks like Slowloris [1].

Seriously, it goes beyond the underlying I/O API that you're using. If
anywhere in the code you're reading from an InputStream or you're writing to
an OutputStream that's connected to an open socket, then that's a blocking
call that can crush your server. Right now, every Java Servlets container
that's not compatible with the latest Servlets API 3.1 can be brought down
with Slowloris, even if under the hood they are using NIO.

Option A for writing a server in Scala is Netty [2].

Option B for writing a server in Scala is the new I/O layer in Akka [3].

[1] [http://ha.ckers.org/slowloris/](http://ha.ckers.org/slowloris/)

[2] [http://netty.io/](http://netty.io/)

[3]
[http://doc.akka.io/docs/akka/snapshot/scala/io.html](http://doc.akka.io/docs/akka/snapshot/scala/io.html)

~~~
caniszczyk
Also see Finagle (uses Netty):
[http://twitter.github.io/finagle/](http://twitter.github.io/finagle/)

------
voidlogic
>The experiments where performed on a 2.7Ghz quad core MacBook Pro with both
client and server running locally

No no no. Assuming your production code runs on Linux THAT is where you need
to do this test. It is extremely naive to assume that either the JVM or the Go
runtime will perform system interfacing tasks even remotely similar between
OSX and Linux. Linux is what you will use in production. Linux is almost
always faster (more effort from devs both on the kernel TCP/IP side and the
runtime/userspace side).

Write your Scala, Java, Go wherever you want, but please, benchmark it in a
clone of your production environment!

P.S. In production I assume your client and server will not be local... don't
do this, kernels do awesome/dirty optimizations over loop-back interfaces,
sometimes even bypassing large parts of the TCP/IP stack, parts you want
included in any meaningful benchmark.

------
jaekwon
Results from my Mac:

    
    
      Go server vs Go client: 10ms
      Scala server vs Scala client: 3ms
      Go server vs Scala client: 4ms
      Scala server vs Go client: ????
    

Scala server against the Go client is really slow (?). I reduced the ping
count by a factor of 100, and extrapolating I think it would have reported
around 670ms. What gives?

I don't know much about Scala Futures, but isn't Scala's client doing
something completely different than the Go client? Scala's client with
Future.sequence looks like it's calling each `ping` method sequentially.

Printing the connection identifier {0,100} on open & close shows that while it
isn't completely sequential, only about a handful of connections are open at a
time.

On the other hand, the Go client appears to switch amongst goroutines more
frequently. All the connections open before any connection closes.

In other words, the I think the difference in performance is due to the
difference in how randomly the connections are shuffled. The terrible
performance time in the last case I think shows a bottleneck in the Scala
server rather than the Go client.

------
damian2000
Isn't this more like a comparison of the JVM's TCP library vs Go's TCP
library... not so much Scala vs Go?

~~~
est
They are all wrappers around the OS TCP/IP stack.

------
aaron42net
If the eventual production app would run on Linux (which I'm only guessing
based on the context), this benchmark should probably be run there. Darwin's
surprisingly higher system call and context switch overhead can be deceptive
for apps that are OS-bound.

------
smegel
Quick look at the language benchmark games, and Go is not 10x slower than Java
for most tests. Java is often 3-5x slower than C/C++/etc (although maybe it
didn't get enough time to warm up).

~~~
trailfox
A quick look at the language benchmark game, and Go varies between 7x slower
than Java and approx. the same speed as Java (varies by test):
[http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?t...](http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?test=all&lang=go&lang2=java)

I fail to see how Go being slower than Java relates to C being faster than
Java.

~~~
smegel
Well given both C and Go are native compiled languages, I would think C is a
realistic (if distant) goal for Go performance, at least for algorithmic stuff
where you're not bouncing in and out of the runtime. I was commenting on how
far it has to go.

~~~
trailfox
Ok, I see. Technically a JIT compiles to native code too, so for a long
running app there shouldn't be much difference. Both Go and Java are garbage
collected, but the JVM has the more sophisticated GC.

------
geal
Ok, so. When you need to write a load balancer and want to test different
languages for the task, you don't do a benchmark like that one.

Writing a "ping-pong"? And not using the same client to test both servers?

It would not have been too hard to write a simple proxy in both languages. Not
even worrying about parsing HTTP headers, just testing TCP, that is really
easy.

Now, if you really want to test the performance, you have to implement it
differently. Just two small features you would need in both:

* non blocking IO: right now, you're starting a new future in Scala, that's easy to write but not really efficient (it might work better with goroutines) * zero copy: if you're load balancing, you will spend your time moving bits, so you'd better make sure that you don't copy them too much. It is possible with Scala, but it looks like Go does not support it

Now, when you have reasonable testing grounds (that woudn't be more that a
hundred lines in both languages), better get your statistics right.

"The client would make 100 concurrent connections and send a total of 1
million pings to the server, evenly distributed over the connections. We
measured the average round trip time" -> that is NOT how you should test.
Here, you would want to know what is the nominal pressure the balancer could
handle, so you must measure a lot of metrics:

* RTT * bandwidth (per user and total) * time to warmup (the JVM optimizes a lot of things on the fly, you have to wait for it) * operational limits (what is the maximal bandwidth for which the performance crashes? same for number of users)

And then, you don't measure only the average values. You must measure the
standard deviation. Because you could have a good average, but wildly varying
data, and that is not good.

Last thing: the macbook may not be a good testing system.

------
Oculus
What would be the advantages of using a custom built load balancer vs.
something off the shelf like Nginx?

------
willvarfar
The benefit of go is the cheap threads and CSP which make it scale well for
complex servers.

I think we'll see the Go runtime gaining Single System Image distribution,
performance improvements and libraries not services (e.g. groupcache) making
it a very different world to develop in than Scala.

~~~
octo_t
Scala has Actors, which are as pure CSP as you can get (Go's CSP has shared
mutable state, Actors do not).

------
dschiptsov
_... that the Go server had a memory footprint of only about 10 MB vs. Scala’s
nearly 200 MB._ Priceless!

200Mb for such a crappy server with only hundred connections.

So, it is not about whose wrapper around epoll is thinner, but about how the
data are represented and copied.

~~~
ratherbefuddled
You can't really come to a conclusion since no information is given about the
configuration of the JVM. Most likely the minimum heap size was unset which
often means it will default to 1/64 of the total system memory.

~~~
dschiptsov
I have my conclusions for quite long time.)

------
shin_lao
I think the JVM TCP stack is extremely mature, but I'm really surprised to see
that Go is ten times slower.

It would have been interesting to have a C/C++ benchmark for reference.

~~~
wyuenho
With Go using 20 times less memory, I'd take Go any day since I can just throw
in 20 more processes and up my throughput 5-10x. But of course, scaling like
this takes a bit of design and the processes have to share nothing.

~~~
shin_lao
I would take the memory measurement with a grain of salt. A 200 MiB large JVM
doesn't mean all the 200 MiB are used. They could have been reserved
(preemptive memory allocation).

~~~
levosmetalo
I agree. I've seen so many times clueless people complain about JVM memory
footprint for tiny "benchmarks". In practice, it's never an issue for long
running web applications. That, and people that don't understand how JIT works
and why it has start-up penalty.

~~~
chetanahuja
_In practice, it 's never an issue for long running web applications._

You might want to clarify that statement a bit. It almost sounds like you are
implying that memory pressure is never an issue for long running web
applications in java. Did you mean to say something else?

~~~
levosmetalo
The initial amount of memory required by VM, while being significant for small
command line utilities is only a fraction of total memory required by an
application. In a web application tuned for performance a lot of the memory
will be used for caching anyways. Also, I'd be glad to pay a small memory hit
upfront if that means that I will get a top quality GC and very low
probability of memory leaks in the long run.

As for the startup time, MojoJolo summed it up.

~~~
chetanahuja
I note that the memory pressure question got swept under the rug there :-)

That's ok. It's not a revelation to anybody here (I hope) the enormous cost in
memory overhead you have to pay for acceptable performance from the JVM. That
"top quality GC" basically requires a 2X overhead (on top of your actual cache
payload) to perform with reliable low latency and high throughput.

~~~
fauigerzigerk
I agree completely. And in spite of that "top quality" GC and all the tuning
in the world you're still running the risk of having the world stop on you for
tens of seconds on larger systems.

The JVM (at least OpenJDK, probably not Azul) is quickly becoming untenable as
server memory goes into the hundereds of GBs. I'm reluctantly moving back to
C++ for that reason alone.

~~~
virmundi
How do you get around heap fragmentation? I know that the JVM (Oracle I
believe) is really limited to about 32 GB of RAM before it has real issues.
But the nice thing is that the GC will compact the heap for better future
performance.

As a possible work around to the JVM limit, a distributed architecture with N
JVMs running a portion of the task could solve the small memory space with
minimal system overhead. What I mean by this let's say you need to have 64 GB
of memory for your app. Given the comment above, Java would not do well with
this. But you could have 4 16 GB VMs each handling 1/4 of the work. The GC
would prevent fragmentation that you'd see in long running C++ apps and still
provide you with operational capacity.

~~~
fauigerzigerk
Heap fragmentation hasn't been a big problem for me. Using multiple JVMs means
to reimplement all data structures in shared memory and create my own memory
allocator or garbage collector for that memory. It's a huge effort.

Many applications can distribute work among multiple processes because they
don't need access to shared data or can use a database for that purpose. But
for what I'm doing (in-memory analytics) that's not an option.

~~~
virmundi
You've probably since moved on from this converstation, but I wonder if Tuple
Space might help [1]. It provides a distributed memory feel to applications.
Apache River provides one such implementation.

Another question about in-memory analytics is do you have to be in-memory? I'm
currently working on an analytics project using Hadoop. With the help of
Cascading [3] we're able to abstract the MR paradigm a lot. As a result we're
doing analytics across 50 TB of data everyday once you count workspace data
duplication.

1 -
[https://en.wikipedia.org/wiki/Tuple_space](https://en.wikipedia.org/wiki/Tuple_space)
2 - [http://river.apache.org/index.html](http://river.apache.org/index.html) 3
- [http://cascading.org](http://cascading.org)

~~~
fauigerzigerk
Thanks for the links. The reason why we decided to go with an in-memory
architecture for this project is that we have (soft) realtime requirements and
complex custom data structures. Users are interactively manipulating a medium
size (hundereds of gigs) dataset that needs to be up-to-date at all times.

The obvious alternative would be to go with a traditional relational database,
but my thinking is that the dataset is small enough to do everything in memory
and avoid all serialization/copying to/from a database, cache or message
queue. Tuple Spaces, as I understand it, is basically a hybrid of all those
things.

------
madisp
"The actual test code contained some functionality to deal with connection
errors, omitted here for brevity."

Any way to see the actual test code?

------
msie
Ugh, after reading all the comments here I wonder how a mere-mortal programmer
gets multi-threaded, network programming done right. It's not clear to me if
there is a clear winner between Go and Scala/JVM. Are the majority of programs
out there crappy, memory-hogging and non-performant?

EDIT: Any good references out there? Thanks!

------
luikore
So this Go server is slower than a single threaded, blocking TCP server in
Ruby. And the memory? Almost the same:

require "socket" s = TCPServer.new 'localhost', 1201 loop do c = s.accept if
c.read(4) c << 'Pong' end c.close end

~~~
dlsspy
It's not obvious to me what that does. Does that only service one of the 100
clients or one of their 10,000 pings each? Or both?

~~~
luikore
It's one at a time. But, since the server only performs a one-shot task
without the "hang on and wait 30 seconds"-like long connections, and, the
default socket backlog of TcpServer is > 100, so every client gets served
within the delay of (0~99)*6ms.

In short, every round is fully served, and the concurrency level is 100.

------
anuraj
It is the JVM rather than Scala - one of the fastest VMs implemented till
date.

------
corresation
As a word of advice -- you are almost certainly wasting your time writing a
load balancer. There are close to zero cases where someone can legitimately
justify such an exercise.

In any case, your benchmark is flawed (as virtually all benchmarks are). The
reason Go _the client_ is slower is because it tries to pervasively "thread"
via the M:N scheduler -- every wait check causes it to yield the actual thread
and switch goroutine, creating a large amount of overhead. The Scala cases, on
the other hand, is dramatically more limited and will not yield this overhead.

The Go server does not have this fault, and is likely top performance. And
aren't we talking about a server anyways?

Now as to the client, while we could naively criticize M:N scheduling based
upon this, try giving it a more realistic workload (unless you seriously plan
on load balancing pongs): Instead of ping/pong, return larger lengths of data
preferably over actual network connections (not localhost) - e.g. 32KB.

The Go client will catch up if not shoot into the lead. M:N scheduling is
optimal for most real-world workloads, though it is less optimal for spin-off-
a-million-goroutines that do nothing type tests.

This is not a test of TCP overhead, or a realistic test, but instead
demonstrates the small overhead of goroutines when you give each a minuscule
amount of wait work.

~~~
diakritikal
You're right about the Go client, I'm also scratching my head as to why it's
doing the extra lookup instead of just using a dial function directly:

    
    
        tcpAddr, _ := net.ResolveTCPAddr("tcp4", "localhost:1201")
        conn, _ := net.DialTCP("tcp", nil, tcpAddr)
    

Vs.

    
    
        val socket = new Socket("localhost", 1201)
    

Edit: Go has e.g. net.Dial("tcp", "localhost:1201") and as someone pointed out
elsewhere for a more accurate bench in both clients why not use the numerical
address instead?

------
cmccabe
EDIT. OK, so. I ran this benchmark myself on an 8-core Xeon running Linux.
2.13 GHz, CentOS 6.2. Kernel was 2.6.32-220.el6.x86_64. 50 gigs of RAM.

I got somewhere between 2.0 and 2.2 "milliseconds per ping" for Scala 2.9, and
somewhere between 3.5 and 3.7 for Go 1.1. This is not the 10x difference that
the authors reported, but it is something. The difference may be due in part
to the different platform and hardware I am using.

Contrary to what I wrote earlier, I noticed that GOMAXPROCS=8 did seem to be
slower than GOMAXPROCS=4 here. I got around 4 "milliseconds per ping" with
GOMAXPROCS=8. Using a mutex and explicit condition variable shaved off maybe
0.2 milliseconds on average (very rough estimation).

Again contrary to what I wrote earlier, Nagle on versus off didn't seem to
matter in the Go code. I still think you should always have it off for a test
like this, but on my setup I did not see a difference.

I still don't think this benchmark is showing what they think it is. I have a
hunch that this is more of a scheduler benchmark than a TCP benchmark at all.
I think I'd have to haul out vtune to get any further, and I'm getting kind of
tired (after midnight here).

~~~
jongraehl
> First of all, they're testing on MacOS, which is not going to be the
> platform they're actually using for the server code. The backends can be
> very different.

Unless you have evidence that Go has MacOS-only bugs, this is meaningless
speculation (though I agree that it's _possible_ that there's no problem on
Linux and that people don't generally run mac servers, I'm not sure why we
should privilege your hypothesis).

Agreed about GOMAXPROCS=4 - that seemed questionable to me (I don't know what
it does, precisely, but I don't see a 4-anything limit in the Scala code).

~~~
jongraehl
Thanks to parent (after edit) for really testing. It turns out that he was
right that Go on Mac is substantially worse than Linux - my bad. Maybe the Go
lib authors didn't put much effort into reading the subtly different
BSD/Darwin vs Linux syscall semantics.

To explain the Nagle algorithm's irrelevance to this case, we have to
understand how it works. It doesn't delay any writes once you read. It only
affects two consecutive small writes (my memory was fuzzy so I checked
[http://en.wikipedia.org/wiki/Nagle's_algorithm](http://en.wikipedia.org/wiki/Nagle's_algorithm)
). Odds of preemption between the client's write and its read seem small, so
it shouldn't matter whether you Nagle or not.

~~~
cmccabe
Yeah, I always seem to forget the exact details of Nagle (probably because
every project I've worked on just turns it off). Write-write-read is the
killer, I guess-- just doing a small write followed by a small read, or vice
versa, should not be affected by Nagle. So the results I got make sense.

Re: MacOS, I do know that some of the Go developers use Macs as their primary
desktops. So I don't think they neglect it, but given that they're targeting
the server space, it makes sense to optimize Linux more.

I still haven't seen any really good explanation of these results. I don't buy
the argument that the JVM is providing the advantage here. The main thing that
the JVM is able to do is dynamically recompile code, and this shouldn't be a
CPU-bound task.

~~~
diakritikal
It's also hard to accurately profile Go programs on OS X because of bugs in
it's now quite stale kernel. Specifically SIGPROF on OS X isn't always sent to
the currently executing thread. Afaik this isn't a problem on newer FreeBSD
kernels.

