Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google and Microsoft Cheat on Slow-Start. Should You? (benstrong.com)
437 points by bstrong on Nov 26, 2010 | hide | past | favorite | 67 comments


This isn't much of a secret. As it says in the article Google are lobbying to change the initial window size in the RFC. A lot of people here at Yahoo! want to see that too, and personally I think we should be more aggressive with our initial window, RFC be damned.

This topic was covered really well by Amazon's John Rauser at Velocity Conf: http://velocityconf.com/velocity2010/public/schedule/detail/...

To address the points in the conclusion:

1. Fast is good. Fast is also profit.

2. The net-neutrality argument here is totally bogus, anyone that knows how can up their slow-start window today if they choose to. There doesn't really have anything to do with traffic shaping.

3. Google have been using their usual data driven approach to support their proposal for IETF. We need a lot more of that. It's great. The only way we can really find out how the Internet in general will react to changes like this is to test them in some real world environment.

4. I agree, slow-start is a good algorithm with a very valid purpose. The real problem here is that the magic numbers powering it aren't being kept inline with changes to connectivity technology and increases in consumer/commercial bandwidth.


There's all sorts of latency problems caused by the congestion window size (and how it gets reset), because of how the algorithm works unless you're sending a continuous stream of data (which allows the congestion window to grow) than the window gets reset to it's initial size which can mean waiting for an ack round-trip before you get the whole message.

While it's not that big a deal if your users are local to you, if they're on a different continent each extra roundtrip can easily add 100ms.

I used to do TCP/IP tuning for low latency trading applications (sometimes you need to use a third party data protocol so can't just use UDP), this sort of stuff used to bite us all the time.

If latency is important it is worth sitting down with tcpdump and seeing how your website loads (i.e how many packets, how many acks, etc.) as often there are ways of tweaking connection setting (either via socket options or kernel settings) that can result in higher performance.

(Try using tcp_slow_start_after_idle if you're using a recent linux kernel; this won't give you a bigger initial window, but it means once your window size has grown it won't get reset straight away if you have a gap between data sends)


This is interesting, but the article and I differ greatly at this point: "Being non-standards-compliant in a way that privileges their flows relative to others seems more than a little hypocritical from a company that's making such a fuss about network neutrality."

No, no it's not. This has nothing to do with network neutrality; it's a purely server-side change/fix. Not only that, they're benefiting users without requiring anyone else to change while they wait for standards bodies to catch up. This is a similar scenario to HTML5 video, and distinctly more clear-cut than e.g. '802.11n draft' wireless routers in my opinion.


"Benefiting _their_ users," yes. But they're not benefiting whoever else is sharing the smallest-capacity link -- what he's arguing is that Google is crowding them out.

Net neutrality is not the only way to privilege your flows on the Internet: There's nothing to stop me from writing a crude application-layer protocol atop UDP that implements reliability but not congestion control. (You maybe had to implement something like this for your networking class in school; otherwise you could start with a protocol like DCCP.) If I were to use that to send data as fast as I could to some remote computer, I could be sending more data than the smallest-capacity link could handle. Other TCP/IP connections sharing that link would detect data loss and thus reduce the amount of data they put in transit, but my protocol wouldn't have to. I can monopolize that link.

So assuming your physical layer is tin cans and string, what he's arguing is that if you have a link with a capacity of 12 segments, then data from Google will use 10 of them and a client will never expand its outstanding data beyond 2 segments. If both used vanilla TCP/IP, they should share the link evenly.

Of course, speed is a critical factor for Google. Android by default uses TCP Westwood+.

It's been six years since I tinkered with TCP/IP and really focused on networking, so someone please correct me if I'm wrong >_<


> So assuming your physical layer is tin cans and string, what he's arguing is that if you have a link with a capacity of 12 segments, then data from Google will use 10 of them and a client will never expand its outstanding data beyond 2 segments. If both used vanilla TCP/IP, they should share the link evenly.

He's not showing any evidence that Google wouldn't back off to a smaller window in the face of packet loss; he's just saying their initial window is 9 segments. Once one of those 10 segments in your tin-can router falls out of its receive buffer, Google will be down to 9 and the other guys get up to 3, and packet loss will random-walk toward fairness.

Right? I've never tinkered with this stuff, so please correct me if I'm wrong.


I thought that the minimum value cwnd could assume is IW, but looking at RFC 2581 that isn't true: "Upon a timeout cwnd MUST be set to no more than the loss window, LW, which equals 1 full-sized segment (regardless of the value of IW)." They even explicitly call out what I erroneously believed, so you are right -- I apologize!

If you really have a tin cans and string physical layer, Google's larger IW could be more disruptive to other connections on the link: A vanilla TCP connection would ramp up cwnd from 1 segment until congestion is observed (in the slow-start phase) and then grow cwnd conservatively (after ssthresh is first set). If congestion would be observed at a cwnd value less than 10 segments, then starting with IW at 10 segments could be very disruptive to others sharing the connection.

Mind you, this argument feels very... academic. As you pointed out, Google's connection would converge toward fairness anyway. (Unless so little data is actually transmitted, then the connection probably isn't open long enough for that to happen.) And most shared links don't saturate at 12 segments. I'd guess that a high-capacity link would only be at risk if it has a lot of connections (so that every connection has a cwnd not much greater than 10) and there are always many (albeit, short-lived) connections to Google always being created (which could appear as fewer long-lived connections with a constant cwnd of 10, more than would be fair).


While it's true that Google is starting aggressively they are still using the slow-start algorithm, not ignoring it. If your physical layer is tin cans with string sure you'll get crowded out, but then the connection will degrade in the same way it would if they were using the default window size.

Microsoft on the other hand should really use slow-start.

I think it's difficult to argue that the profile of the underlying network hasn't changed since the last time an RFC was standardized on this issue. The problem is revving the magic numbers in the standards periodically to reflect changes in topology.

While you can say that Google should stick to the standard, unlike other net neutrality issues this isn't a change available only to a few large companies. Anyone with control of their stack can make this configuration.

The issue in net neutrality is to ensure that changes which are economically feasible for only a small group of companies are not enacted so that they cannot form a defacto monopoly.


that sums about the "trick" i used in 97-98 over "tin cans and string" Russian links of the time to make sure that my large data would make it. It was really hard on my "neighbors" at the time.


You re spot on. Whether the protocol needs to change with changing times and network speeds is an entirely different question.


I'm not sure how the fact that it's a server-side change makes it ok. I think that everyone would agree that turning off congestion control entirely on the server side would be bad and would negatively impact other flows.

The question, then, is whether this change is significant enough to increase internet congestion (and therefore packet loss for others). This is a subject of heated debate at the moment.


It doesn't make it okay, it just makes it not a "net neutrality" issue. It's more of a "good neighbour" issue.


Network neutrality says "you shouldn't be able to buy better network performance than your competition". What google is doing is violating standards in order to give themselves better network performance than their competition.

It's not exactly the network neutrality issue, no; but it's related, and where it differs, I'd say that what Google is doing is worse -- at least having companies pay for packet prioritization won't cause internet congestion collapse.


Fair point. In retrospect, invoking net neutrality wasn't really called for.


Are neutrality and hostility mutually exclusive?


Being a bad neighbor doesn't imply hostility.


Also, as a technical point, the amount of total internet utilization caused by the fetching of http web pages is so small that I doubt this practice could significantly harm any other traffic.


> that I doubt this practice could significantly harm any other traffic.

I agree with you, but whenever I see something like this the back of my mind always chimes in with "famous last words".


They're only famous when they are the last words said. But they are said far more often and wind up being proved correct.


I think you're reading too much into this:)

It's just an expression, who's meaning has moved away from it's literal translation to now mean something like: "How bad could it be" or "I know this is going to bite me in the ass someday"


IT has nothing to do with net neutrality, but it does have to do with the stability and reliability of the internet at large. If everyone, for example, tweaked TCP in different, incompatible ways, we'd have contention all over and things just wouldn't work well.


Google is proposing this should be allowed as a modification to rfc-3390. Their draft is http://tools.ietf.org/html/draft-hkchu-tcpm-initcwnd-01. Active discussion of the issue may be found at http://www.ietf.org/mail-archive/web/tcpm/current/maillist.h...


Minor correction: the current draft is http://tools.ietf.org/html/draft-ietf-tcpm-initcwnd-00


Really interesting research, but man, if you really, really have to worry about premature optimization for your web app, I'd start with the usual bottlenecks first - i.e. anything that involves disk IO and/or processor work, such as databases and mathematical calculations.

Unless you are serving static content only (in which case you are hardly creating an "app"), the milliseconds you might save with TCP-level optimizations are peanuts in comparison to the multiple seconds your database and computations will be requiring.


This is exactly backwards. My network latency to North America is >200ms (RTT). Three round-trip times is about 750ms. You can do 75 disk accesses and three billion mathematical calculations in that time.

If your database and computations are requiring multiple seconds on a normal web page, you have serious user experience problems. When you're under 140ms, it feels like the response is happening at the same time as the request (Dabrowski and Munson weren't able to reproduce the old 50- or 100-millisecond rule of thumb in what sounds to me like a poorly-controlled experiment; http://books.google.com/books?id=aU0MR-MA-BMC&pg=PA292&#...). Increasing Google search page render time from 400ms to 900ms dropped traffic by 20%, according to Marissa Mayer (http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20....). Traditional OLTP systems tried to keep response times under one second; beyond a second, people start to get frustrated and wonder if something is broken.

So, for a normal application, the milliseconds you might save by optimizing your database and computations are peanuts in comparison to the second or more that TCP-level optimizations could save you.


Dabrowski and Munson weren't able to reproduce the old 50- or 100-millisecond rule of thumb in what sounds to me like a poorly-controlled experiment

It sounds rather poorly-controlled to me too.

They mention that they didn't account for the time from mouse-down to mouse-release. Seriously? Here is a program that can measure that difference: http://www.daimi.au.dk/~sandmann/click.py. It uses GTK+ so you'll probably need Linux to run it. For me, the delay seems to be around 30 ms. They also don't mention the framerate of the screen or whether they controlled for that. On a 60Hz monitor there is a delay of 17ms between frames.

Both 17 and 30 ms are huge numbers if you are measuring intervals on the order of 100ms.

Then there is the question of what you consciously perceive vs. what you subconsciously perceive. It would surprise me if you couldn't measure a difference even in cases where the subjects didn't notice anything themselves.

Finally, we can definitely reject the idea that latencies below 100 ms never matter: there is an obvious difference between a 10 fps animation and a 60 fps one.


> If your database and computations are requiring multiple seconds on a normal web page, you have serious user experience problems.

This is almost always the case when I think "this website is slow", though. When HN is slow, it's not because of some added network latency, but because something is making HN take 3 seconds to serve up my "threads" page, or 2 seconds to successfully post my comment. Same with "reddit is slow", or "this Wordpress blog is taking forever to load" or "Twitter spins the 'working' icon for 2 seconds when I click on that 'retweets' thing in the sidebar before returning anything". Those things are really common in my experience, and those rather than network round-trip times are by far the biggest and most annoying slowdowns, at least in my browsing.


Yes, of course, because network latency is probably almost the same to almost all sites to you. So the vast majority of pages take 400ms or whatever to load, and you don't think anything of it. You only notice when you hit a site that's out of the ordinary — whether it's Google loading in 85ms by using IW9, or Twitter taking multiple seconds to query the snail farm or frozen molasses or whatever it uses for its database now that it's dumped Cassandra.

If IW9 or IW10 gets widely adopted, you'll find yourself thinking "this website is slow" when you visit the rare backwater that doesn't use it.


Clearly you have never dealt with a large database.


I agree fully. I focused on the front-end because squeezing milliseconds out of the backend is my day job, and I'm pretty confident I can generate pages in < 50ms. Given that, I thought it would be interesting to see just how much I could squeeze out of the delivery time.


There was a large discussion earlier about the subject. I posted detailed comments in that thread so I won't repost but just link. http://news.ycombinator.com/item?id=1143317


Can anyone else repeat his experiment?

I tried repeating the experiment. I'm in Sweden, so, annoyingly, a request to google.com redirects to google.se. If I send my request directly to google.se, I get 9k response in 130ms and the initial window looks like 4 to me, i.e. I can't see anything unexpected happening.

I then tried repeating on Amazon EC2. I can't see anything unexpected there either, but the RTT from EC2 to google is only about 3ms, which means I can't assume that the ACKS don't get there.

(The original article author looks at how long the initial 3-way handshake takes and then assumes that all packets take that long, or, probably, half as long, i.e. he assumes that ACKS sent up to one RTT before a packet from google can't have arrived at google in time to affect that packet)

Can anyone else reproduce the experiment?

Other ideas: repeat from Sweden, but send a cookie so that I really get google.com. Repeat from EC2, but make sure I never send any ACKs after the three-way handshake. I'm not curious enough to do the latter, it's a fair bit of work.


It seems Linux does not have option to skip slow start and just use receiver's advertised window. Does anybody know where in net/ipv4/tcp.c this should be set?


Note that we're not talking about skipping slow start completely; we're talking about changing slow start parameters.


Does anyone know what you would do to easily tune this for FreeBSD?


My guess is setting the sysctl "net.inet.tcp.slowstart_flightsize" from the default value of "1" to something else.


Thx. Seems to work! Looks like the defaults are:

net.inet.tcp.local_slowstart_flightsize: 4 net.inet.tcp.slowstart_flightsize: 1

Also useful: http://spatula.net/blog/2007/04/freebsd-network-performance-...


Any idea if that would work on Linux?


No idea. Type "sysctl -a | grep slow" and see what knobs the kernel offers.


I believe the current limit for slow-start are not adapted to the current Internet anymore.

According to my own observations, the first 30Ko of my pages seem to be transfered faster then the next 30ko. It is not until much more is sent that the average throughput eventually get up to what it was during the first 30ko.

This is definitely weird.

Note: I am using Ubuntu on EC2 hosted VMs.

As a result, for as much as I can, I try to keep the size of my content below 30ko, using multiple concurrent HTTP requests.

I believe this is related to "slow-start" being pessimistic.

Unfortunately, "slow-start" is not configurable on Linux and I don't feel confident enough to go with some kernel level patch...

Any clue?


You can't use custom kernels on Amazon EC2 anyway, so kernel patches aren't really an option (unless you had some kind of kernel module you can load that would change the value in memory, which seems dangerous).



"It is better to ask for forgiveness than permission" - Rear Admiral Grace Hopper


We have been measuring google's "reachability" performance and it is quite amazing. The results of their tuning is that they can achieve downloading of their initial HTML in under ~250 milliseconds and many locations under 100 ms. The other thing the data shows is the standard deviation on the download times are very small making the site consistently load fast.

http://www.yottaa.com/url/4be004065df8ca5a730001fb/reachabil...


"They actually managed to deliver the whole response in just 70ms, 30ms of which was spent generating the response"

Isn't part of that just the network latency? Based on the timestamps for the SYN and SYN-ACK it looks like a RTT of about 16ms.

EDIT: Nevermind.

Request was sent by the client at 00.017437

Request ACK was received by the client at 00.037139

RTT of about 20ms, so the request was received by the server around 00.027

First packet of the response was received by the client at 00.067151

67-27=40. Assuming a latency of 10ms it took 30ms to generate the request.


One should also note that when IE is talking to IIS, the request will be sent in the first packet and the initial response will be sent in the first ACK. You can actually complete a request and response (if small enough) in 3 packets. Also, when tearing down the connection, it's left half-open.

http://osdir.com/ml/mozilla.devel.netlib/2003-01/msg00018.ht...


Mike Belshe - An Argument For Changing TCP Slow Start (Jan 11, 2010):

http://sites.google.com/a/chromium.org/dev/spdy/An_Argument_...


On Ubuntu 8.04 (at least), you can set this per route via something like:

ip route change default via x.x.x.x dev eth0 initcwnd 6

but please test thoroughly if trying this.


Do app engine apps also serve like this?


I don't think so. I'm pulling a 27k URL over a 100ms latency and I'm seeing roughly 2, 4, 8, 8... for the send bursts.


Very interesting. Is such a thing configurable in Apache or nginx? It seems to be a rather rude behavior, but I'm curious how accessible it is.


I don't think any web server gets to this level of control on the tcp/ip level, this is something that should be addressed in the OS' network stack.


Unless you have an exokernel operating system.


Nah, that's not even necessary, I believe: the congestion window is a TCP feature, but any normal user space program (with a few privileges) can create IP datagrams with custom content, so you would only need to implement a bit of tcp in a normal old style monolithic kernel

And since you control the only higher level protocol (your http implementation) you can probably cut some corners and get done quickly :)


You may be able to create them, but as a server, how would you receive incoming connections?


No need for an exokernel here. A microkernel could enable the same thing, since your network stack would be Yet Another Service (TM). This could easily be implemented on the likes of L4.


I don't really see the difference between patching the kernel and patching the privileged network stack server in a microkernel environment.


In an exokernel environment, the network stack server wouldn't be privileged.


It's not tunable at the app level. On linux it requires a kernel patch to change. I'm not sure about other OS's.


I wish there was a patch where you could enable it on linux via a ioctl/setsockopt. Would be very useful to those of us who aren't google.

Edit:

I wished, google already sent the patch in May for this:

http://www.amailbox.org/mailarchive/linux-netdev/2010/5/26/6...

Nice!


something like: ip route change default via 0.0.0.0 dev eth0 initcwnd 10


Perhaps "ip route change default via <your gateway address> dev eth0 initcwnd 10" -- your gateway address, not 0.0.0.0.


That doesn't seem to work for me:

    RTNETLINK answers: No such file or directory
Apparently I need to patch my kernel?


slight offtopic: that's one of my favorite error messages that is actually relatively common.


very interesting post.


Interesting. But it really annoys me when people use "begs the question" incorrectly. Look it up!


Interesting, but there are so much more important things to consider before worrying about the load time. (i.e. 0 user experiencing 30 ms is far worst..)


Half agree: this level of optimization is less useful when you're not starting from Google's performance baseline. That said, can't agree with point generally applied to load times: optimizing them made a difference even at BCC scales back in 2008ish. Implementing half of the YSlow recommendations takes under an hour in modern web frameworks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: