
Google and Microsoft Cheat on Slow-Start. Should You? - bstrong
http://blog.benstrong.com/2010/11/google-and-microsoft-cheat-on-slow.html
======
sh1mmer
This isn't much of a secret. As it says in the article Google are lobbying to
change the initial window size in the RFC. A lot of people here at Yahoo! want
to see that too, and personally I think we should be more aggressive with our
initial window, RFC be damned.

This topic was covered really well by Amazon's John Rauser at Velocity Conf:
[http://velocityconf.com/velocity2010/public/schedule/detail/...](http://velocityconf.com/velocity2010/public/schedule/detail/11792)

To address the points in the conclusion:

1\. Fast is good. Fast is also profit.

2\. The net-neutrality argument here is totally bogus, anyone that knows how
can up their slow-start window today if they choose to. There doesn't really
have anything to do with traffic shaping.

3\. Google have been using their usual data driven approach to support their
proposal for IETF. We need a lot more of that. It's great. The only way we can
really find out how the Internet in general will react to changes like this is
to test them in some real world environment.

4\. I agree, slow-start is a good algorithm with a very valid purpose. The
real problem here is that the magic numbers powering it aren't being kept
inline with changes to connectivity technology and increases in
consumer/commercial bandwidth.

------
ig1
There's all sorts of latency problems caused by the congestion window size
(and how it gets reset), because of how the algorithm works unless you're
sending a continuous stream of data (which allows the congestion window to
grow) than the window gets reset to it's initial size which can mean waiting
for an ack round-trip before you get the whole message.

While it's not that big a deal if your users are local to you, if they're on a
different continent each extra roundtrip can easily add 100ms.

I used to do TCP/IP tuning for low latency trading applications (sometimes you
need to use a third party data protocol so can't just use UDP), this sort of
stuff used to bite us all the time.

If latency is important it is worth sitting down with tcpdump and seeing how
your website loads (i.e how many packets, how many acks, etc.) as often there
are ways of tweaking connection setting (either via socket options or kernel
settings) that can result in higher performance.

(Try using tcp_slow_start_after_idle if you're using a recent linux kernel;
this won't give you a bigger initial window, but it means once your window
size has grown it won't get reset straight away if you have a gap between data
sends)

------
Pahalial
This is interesting, but the article and I differ greatly at this point:
"Being non-standards-compliant in a way that privileges their flows relative
to others seems more than a little hypocritical from a company that's making
such a fuss about network neutrality."

No, no it's not. This has nothing to do with network neutrality; it's a purely
server-side change/fix. Not only that, they're benefiting users without
requiring anyone else to change while they wait for standards bodies to catch
up. This is a similar scenario to HTML5 video, and distinctly more clear-cut
than e.g. '802.11n draft' wireless routers in my opinion.

~~~
shadowmatter
"Benefiting _their_ users," yes. But they're not benefiting whoever else is
sharing the smallest-capacity link -- what he's arguing is that Google is
crowding them out.

Net neutrality is not the only way to privilege your flows on the Internet:
There's nothing to stop me from writing a crude application-layer protocol
atop UDP that implements reliability but not congestion control. (You maybe
had to implement something like this for your networking class in school;
otherwise you could start with a protocol like DCCP.) If I were to use that to
send data as fast as I could to some remote computer, I could be sending more
data than the smallest-capacity link could handle. Other TCP/IP connections
sharing that link would detect data loss and thus reduce the amount of data
they put in transit, but my protocol wouldn't have to. I can monopolize that
link.

So assuming your physical layer is tin cans and string, what he's arguing is
that if you have a link with a capacity of 12 segments, then data from Google
will use 10 of them and a client will never expand its outstanding data beyond
2 segments. If both used vanilla TCP/IP, they should share the link evenly.

Of course, speed is a critical factor for Google. Android by default uses TCP
Westwood+.

It's been six years since I tinkered with TCP/IP and really focused on
networking, so someone please correct me if I'm wrong >_<

~~~
kragen
> So assuming your physical layer is tin cans and string, what he's arguing is
> that if you have a link with a capacity of 12 segments, then data from
> Google will use 10 of them and a client will never expand its outstanding
> data beyond 2 segments. If both used vanilla TCP/IP, they should share the
> link evenly.

He's not showing any evidence that Google wouldn't back off to a smaller
window in the face of packet loss; he's just saying their _initial_ window is
9 segments. Once one of those 10 segments in your tin-can router falls out of
its receive buffer, Google will be down to 9 and the other guys get up to 3,
and packet loss will random-walk toward fairness.

Right? I've _never_ tinkered with this stuff, so please correct me if I'm
wrong.

~~~
shadowmatter
I thought that the minimum value cwnd could assume is IW, but looking at RFC
2581 that isn't true: "Upon a timeout cwnd MUST be set to no more than the
loss window, LW, which equals 1 full-sized segment (regardless of the value of
IW)." They even explicitly call out what I erroneously believed, so you are
right -- I apologize!

If you really have a tin cans and string physical layer, Google's larger IW
could be more disruptive to other connections on the link: A vanilla TCP
connection would ramp up cwnd from 1 segment until congestion is observed (in
the slow-start phase) and then grow cwnd conservatively (after ssthresh is
first set). If congestion would be observed at a cwnd value less than 10
segments, then starting with IW at 10 segments could be very disruptive to
others sharing the connection.

Mind you, this argument feels very... academic. As you pointed out, Google's
connection would converge toward fairness anyway. (Unless so little data is
actually transmitted, then the connection probably isn't open long enough for
that to happen.) And most shared links don't saturate at 12 segments. I'd
guess that a high-capacity link would only be at risk if it has a lot of
connections (so that every connection has a cwnd not much greater than 10) and
there are always many (albeit, short-lived) connections to Google always being
created (which could appear as fewer long-lived connections with a constant
cwnd of 10, more than would be fair).

------
ajb
Google is proposing this should be allowed as a modification to rfc-3390.
Their draft is <http://tools.ietf.org/html/draft-hkchu-tcpm-initcwnd-01>.
Active discussion of the issue may be found at [http://www.ietf.org/mail-
archive/web/tcpm/current/maillist.h...](http://www.ietf.org/mail-
archive/web/tcpm/current/maillist.html)

~~~
benblack
Minor correction: the current draft is <http://tools.ietf.org/html/draft-ietf-
tcpm-initcwnd-00>

------
arturadib
Really interesting research, but man, if you really, _really_ have to worry
about premature optimization for your web app, I'd start with the usual
bottlenecks first - i.e. anything that involves disk IO and/or processor work,
such as databases and mathematical calculations.

Unless you are serving static content only (in which case you are hardly
creating an "app"), the milliseconds you might save with TCP-level
optimizations are _peanuts_ in comparison to the multiple seconds your
database and computations will be requiring.

~~~
kragen
This is exactly backwards. My network latency to North America is >200ms
(RTT). Three round-trip times is about 750ms. You can do 75 disk accesses and
three _billion_ mathematical calculations in that time.

If your database and computations are requiring _multiple seconds_ on a normal
web page, you have _serious_ user experience problems. When you're under
140ms, it feels like the response is happening at the same time as the request
(Dabrowski and Munson weren't able to reproduce the old 50- or 100-millisecond
rule of thumb in what sounds to me like a poorly-controlled experiment;
[http://books.google.com/books?id=aU0MR-MA-
BMC&pg=PA292&#...](http://books.google.com/books?id=aU0MR-MA-
BMC&pg=PA292&lpg=PA292&dq=200+millisecond+user+interface&source=bl&ots=Pxl6LkkTpQ&sig=5hOX-
eKJwi95Ete-
PsdgL7CczbI&hl=en&ei=YSPwTJPONoT48AbOjNHzCw&sa=X&oi=book_result&ct=result&resnum=3&ved=0CB0Q6AEwAg#v=onepage&q=200%20millisecond%20user%20interface&f=false)).
Increasing Google search page render time from 400ms to 900ms dropped traffic
by 20%, according to Marissa Mayer
([http://glinden.blogspot.com/2006/11/marissa-mayer-at-
web-20....](http://glinden.blogspot.com/2006/11/marissa-mayer-at-
web-20.html)). Traditional OLTP systems tried to keep response times under one
second; beyond a second, people start to get frustrated and wonder if
something is broken.

So, for a normal application, the milliseconds you might save by optimizing
your database and computations are _peanuts_ in comparison to the second or
more that TCP-level optimizations could save you.

~~~
_delirium
> If your database and computations are requiring multiple seconds on a normal
> web page, you have serious user experience problems.

This is almost always the case when I think "this website is slow", though.
When HN is slow, it's not because of some added network latency, but because
something is making HN take 3 seconds to serve up my "threads" page, or 2
seconds to successfully post my comment. Same with "reddit is slow", or "this
Wordpress blog is taking forever to load" or "Twitter spins the 'working' icon
for 2 seconds when I click on that 'retweets' thing in the sidebar before
returning anything". Those things are _really_ common in my experience, and
those rather than network round-trip times are by far the biggest and most
annoying slowdowns, at least in my browsing.

~~~
kragen
Yes, of course, because network latency is probably almost the same to almost
all sites to you. So the vast majority of pages take 400ms or whatever to
load, and you don't think anything of it. You only notice when you hit a site
that's out of the ordinary — whether it's Google loading in 85ms by using IW9,
or Twitter taking multiple seconds to query the snail farm or frozen molasses
or whatever it uses for its database now that it's dumped Cassandra.

If IW9 or IW10 gets widely adopted, you'll find yourself thinking "this
website is slow" when you visit the rare backwater that doesn't use it.

------
necro
There was a large discussion earlier about the subject. I posted detailed
comments in that thread so I won't repost but just link.
<http://news.ycombinator.com/item?id=1143317>

------
matthiasl
Can anyone else repeat his experiment?

I tried repeating the experiment. I'm in Sweden, so, annoyingly, a request to
google.com redirects to google.se. If I send my request directly to google.se,
I get 9k response in 130ms and the initial window looks like 4 to me, i.e. I
can't see anything unexpected happening.

I then tried repeating on Amazon EC2. I can't see anything unexpected there
either, but the RTT from EC2 to google is only about 3ms, which means I can't
assume that the ACKS don't get there.

(The original article author looks at how long the initial 3-way handshake
takes and then assumes that all packets take that long, or, probably, half as
long, i.e. he assumes that ACKS sent up to one RTT before a packet from google
can't have arrived at google in time to affect that packet)

Can anyone else reproduce the experiment?

Other ideas: repeat from Sweden, but send a cookie so that I really get
google.com. Repeat from EC2, but make sure I never send any ACKs after the
three-way handshake. I'm not curious enough to do the latter, it's a fair bit
of work.

------
sdizdar
It seems Linux does not have option to skip slow start and just use receiver's
advertised window. Does anybody know where in net/ipv4/tcp.c this should be
set?

~~~
wmf
Note that we're not talking about skipping slow start completely; we're
talking about changing slow start parameters.

------
epi0Bauqu
Does anyone know what you would do to easily tune this for FreeBSD?

~~~
SageRaven
My guess is setting the sysctl "net.inet.tcp.slowstart_flightsize" from the
default value of "1" to something else.

~~~
StavrosK
Any idea if that would work on Linux?

~~~
SageRaven
No idea. Type "sysctl -a | grep slow" and see what knobs the kernel offers.

------
jhrobert
I believe the current limit for slow-start are not adapted to the current
Internet anymore.

According to my own observations, the first 30Ko of my pages seem to be
transfered faster then the next 30ko. It is not until much more is sent that
the average throughput eventually get up to what it was during the first 30ko.

This is definitely weird.

Note: I am using Ubuntu on EC2 hosted VMs.

As a result, for as much as I can, I try to keep the size of my content below
30ko, using multiple concurrent HTTP requests.

I believe this is related to "slow-start" being pessimistic.

Unfortunately, "slow-start" is not configurable on Linux and I don't feel
confident enough to go with some kernel level patch...

Any clue?

~~~
danudey
You can't use custom kernels on Amazon EC2 anyway, so kernel patches aren't
really an option (unless you had some kind of kernel module you can load that
would change the value in memory, which seems dangerous).

~~~
spullara
You can use custom kernels on EC2 now.

[http://ec2-downloads.s3.amazonaws.com/user_specified_kernels...](http://ec2-downloads.s3.amazonaws.com/user_specified_kernels.pdf)

------
vinutheraj
"It is better to ask for forgiveness than permission" - Rear Admiral Grace
Hopper

------
bbuffone
We have been measuring google's "reachability" performance and it is quite
amazing. The results of their tuning is that they can achieve downloading of
their initial HTML in under ~250 milliseconds and many locations under 100 ms.
The other thing the data shows is the standard deviation on the download times
are very small making the site consistently load fast.

[http://www.yottaa.com/url/4be004065df8ca5a730001fb/reachabil...](http://www.yottaa.com/url/4be004065df8ca5a730001fb/reachability)

------
tlrobinson
_"They actually managed to deliver the whole response in just 70ms, 30ms of
which was spent generating the response"_

Isn't part of that just the network latency? Based on the timestamps for the
SYN and SYN-ACK it looks like a RTT of about 16ms.

EDIT: Nevermind.

Request was sent by the client at 00.017437

Request ACK was received by the client at 00.037139

RTT of about 20ms, so the request was received by the server around 00.027

First packet of the response was received by the client at 00.067151

67-27=40. Assuming a latency of 10ms it took 30ms to generate the request.

------
fleitz
One should also note that when IE is talking to IIS, the request will be sent
in the first packet and the initial response will be sent in the first ACK.
You can actually complete a request and response (if small enough) in 3
packets. Also, when tearing down the connection, it's left half-open.

[http://osdir.com/ml/mozilla.devel.netlib/2003-01/msg00018.ht...](http://osdir.com/ml/mozilla.devel.netlib/2003-01/msg00018.html)

------
samueladam
Mike Belshe - An Argument For Changing TCP Slow Start (Jan 11, 2010):

[http://sites.google.com/a/chromium.org/dev/spdy/An_Argument_...](http://sites.google.com/a/chromium.org/dev/spdy/An_Argument_For_Changing_TCP_Slow_Start.pdf)

------
bengtan
On Ubuntu 8.04 (at least), you can set this per route via something like:

ip route change default via x.x.x.x dev eth0 initcwnd 6

but please test thoroughly if trying this.

------
bemmu
Do app engine apps also serve like this?

~~~
jws
I don't think so. I'm pulling a 27k URL over a 100ms latency and I'm seeing
roughly 2, 4, 8, 8... for the send bursts.

------
ergo98
Very interesting. Is such a thing configurable in Apache or nginx? It seems to
be a rather rude behavior, but I'm curious how accessible it is.

~~~
8plot
something like: ip route change default via 0.0.0.0 dev eth0 initcwnd 10

~~~
kragen
That doesn't seem to work for me:

    
    
        RTNETLINK answers: No such file or directory
    

Apparently I need to patch my kernel?

~~~
zokier
slight offtopic: that's one of my favorite error messages that is actually
relatively common.

------
iepaul
very interesting post.

------
phillijw
Interesting. But it really annoys me when people use "begs the question"
incorrectly. Look it up!

------
d0m
Interesting, but there are so much more important things to consider before
worrying about the load time. (i.e. 0 user experiencing 30 ms is far worst..)

~~~
patio11
Half agree: this level of optimization is less useful when you're not starting
from Google's performance baseline. That said, can't agree with point
generally applied to load times: optimizing them made a difference even at BCC
scales back in 2008ish. Implementing half of the YSlow recommendations takes
under an hour in modern web frameworks.

