
Bandwidth needs halved by new compression written in Go - marmalade
http://arstechnica.com/information-technology/2013/02/cloudflare-blows-hole-in-laws-of-web-physics-with-go-and-railgun/
======
rjknight
The title suggests that there's something unique about Go, either the language
or its standard library, that enables bandwidth savings. In fact, Cloudflare
have written some software which they claim enables them to reduce their
bandwidth, and this software happens to be written in Go. This might be an
excellent choice (and I suspect it probably is), but it's not _Go per se_ that
is reducing the bandwidth usage.

~~~
jgrahamc
I agree. The benefit of using Go is that it's fast to write and has good
concurrency features. To give you an idea of the size, there are 7,329 lines
of Go code in Railgun (including comments) and a 6,602 line test suite.

In the process we've committed various things back to Go itself and at some
point I'll write a blog on the whole experience, but one thing that made a big
difference was to write a memory recycler so that for commonly created things
(in our case []byte buffers) we don't force the garbage collector to keep
reaping memory that we then go back and ask for.

The concurrency through communication is trivial to work with once you get the
hang of it and being able to write completely sequential code means that it's
easy to grok what your own program is doing.

We've hit some deficiencies in the standard library (around HTTP handling) but
it's been fairly smooth. And, as the article says, we swapped native crypto
for OpenSSL for speed.

The Go tool chain is very nice. Stuff like go fmt, go tool pprof, go vet, go
build make working with it smooth.

PS We're hiring.

~~~
gatherknwldg
"write a memory recycler"

 _sigh_ This is by far go's biggest wart IMO, and one that frequently sends me
back to a pauseless (hah! at least less pausy:) systems language. I sure do
like it in almost every other meaningful regard. But I wish latency wasn't
something the designers punted on.

~~~
saidajigumi
I occasionally hear this kind of complaint, but I've yet to see any silver-
bullet memory management system. AFAICT, the best we've been able to
accomplish is to provide a easier path to correctness with decent overall
performance. Also, GC latency isn't the only concern. As soon as the magic
incantation "high performance" is uttered, all bets are off.

There's been decades of work on real-time garbage collection yet all of those
approaches still have tradeoffs. Consider that object recycling is a
ubiquitous iOS memory management pattern. This reduces both memory allocation
latencies and object recreation overhead. Ever flick-scroll a long list view
on an iPhone? Those list elements that fly off the top are virtually
immediately recycled back to the bottom -- it's like a carousel with only
about as many items as you can see on screen. The view objects are continually
reused, just with new backing data. This approach to performance is more
holistic than simply pushing responsibility onto the memory allocator.

Memory recycling here also reminds me of frame-based memory allocator
techniques written up in the old Graphics Gems books, a technique likewise
covered in real-time systems resources. Allocating memory from the operating
system can be relatively expensive and inefficient, even using good ol'
malloc. A frame-based allocator grabs a baseline number of pages and provides
allocation for one or more common memory object sizes (aka "frames"). Pools
for a given frame size are kept separate, which prevents memory fragmentation.
Allocation performance is _much_ faster than straight malloc, while increasing
memory efficiency for small object allocation and eliminating fragmentation.
Again, this is a problem-specific approach that considers needs beyond
latency.

~~~
haberman
I can't speak for the grandparent, but for my part I agree with your point
that allocation patterns matter and that there is no silver bullet to memory
management, which is exactly the reason that GC'd languages like Go are
uninteresting as systems languages. Why use a language where you have to work
around one of its main features when you care about performance?

I find Rust's approach much more interesting, because GC is entirely optional,
but it provides abstractions that make it easier to write clear and correct
manual memory management schemes.

~~~
saidajigumi
Hm, Rust keeps hitting my radar with interesting attributes like this. Time to
go have a look see. Thanks!

------
calinet6
Go is used, sure, but the cool part about this is the binary Railgun protocol.
Really smart. Send only file hashes and binary diffs back and forth, do a
little extra computation to figure out the changes, but only send the absolute
minimum data you need to the CDN. That's just smart, and frankly, I hope other
CDNs have been doing this already, because at any high volume it seems to be
an obvious solution.

So that brings up the question—is this just something CloudFlare is announcing
for the PR, or is it actually innovative?

~~~
0x0
It sounds like they reinvented rsync to me?

~~~
jgrahamc
No, because we have more information than rsync does. We own both ends of the
connection and can keep versions synchronized.

~~~
0x0
That sounds interesting, could you elaborate on how it is different from rsync
though? "Keep versions synchronized" is a bit vague

~~~
jgrahamc
The piece in the CloudFlare network and the piece in the customer network are
able to keep track of which page versions they each have and so the part in
the CloudFlare network sends a request saying "Please do GET /foo and compress
it against version X". That means that at request time there's no back-and-
forth between the components deciding what compression dictionary to use.

~~~
0x0
A bit like rsync's --fuzzy or --compare-dest then?

~~~
jgrahamc
Well, fuzzy tries to find something to use as a 'destination' file so it can
send across some hashes. Railgun has more complete information because it is
keeping synchronized and thus the part making a request can specify the
dictionary to compress with in a single hash.

~~~
0x0
Thanks for the explanation, that does sound useful! :)

------
bitcartel
The bandwidth reduction is due to use of a binary protocol, not Go. It just so
happens the server code is written in Go and C.

From the article:

“Go is very light,” he said, “and it has fundamental support for concurrent
programming. And it’s surprisingly stable for a young language. The experience
has been extremely good—there have been no problems with deadlocks or pointer
exceptions.” But the code hit a bit of a performance bottleneck under
CloudFlare’s heavy loads, particularly because of its cryptographic
modules—all of Railgun’s traffic is encrypted from end to end. “We swapped
some things out into C just from a performance perspective," Graham-Cumming
said.

“We want Go to be as fast as C for these things,” he explained, and in the
long term he believes Go’s cryptographic modules will mature and get better.
But in the meantime, “we swapped out Go’s native crypto for OpenSSL,” he said,
using assembly language versions of the C libraries.

~~~
shanelja
On another note, it's always nice to see such an influential part of the HN
community giving quotes for sites like this - not only does it make me a
little proud to be associated with any of you, it makes me more hopeful for
the chances of my future that I can call myself one of _us_.

~~~
lclarkmichalek
Success by association seems about as valid as guilt by association.

------
sigil
Question for jgrahamc: how much more efficient is your binary delta algorithm
than cperciva's bsdiff [1]?

I assume since you've got the preimages of compression, as well as control
over the compression format, that the diff and patch operations are much more
efficient in space and time than they would be with arbitrary binary data.
But...by how much?

[1] <http://www.daemonology.net/bsdiff/>

~~~
jakubw
bsdiff is not a general purpose binary delta algorithm, it's targeted at
executables. When you change a single line in the source code of a program and
recompile it, bsdiff produces a small diff, even though a normal binary diff
between the old and new executable would be huge due to how even a single
extra instruction can cause many more addresses to shift. bsdiff wouldn't be
particularly useful here.

~~~
sigil
This is true. Re-reading the bsdiff paper, it's pretty tailored to executable
file formats.

<http://www.daemonology.net/papers/bsdiff.pdf>

~~~
cperciva
It works fine on non-executables too. Executables are the hard case, that's
all.

------
silvertonia
Could be very cool. I couldn't get through the article because it read like a
press release. Maybe if someone who hasn't been spoon-fed the story reports on
it, I'll take notice.

~~~
peterwwillis
I don't know why you're being downvoted, the article is written pretty
shittily. The article is mostly just quotes from jgc and the CEO and some
filler by the writer.

Also the assertion that "It has already cut the bandwidth used by 4Chan and
Imgur by half" sounds disingenuous and possibly not backed up by moot's quote
_“We've seen a ~50% reduction in backend transfer for our HTML pages (transfer
between our servers and CloudFlare's),”_. Is backend transfer for HTML pages
the only bandwidth they're using? Is the rest of it halved, and if so, how and
why?

The title of the story also makes me gag.

------
justinsb
I think this is just RFC 3229, with a binary protocol (?)
<http://www.ietf.org/rfc/rfc3229.txt>

I've always thought there were some potential attacks there around cache
disclosure (which Google avoided by going with SDCH instead).

CloudFlare controls the server and the client, so they don't need to worry
about the attacks or about persuading everyone to adopt their RFC.

------
glymor
How large is the per site cache? Are cookies part of the hash (and if so how
do you strip meaningless cookies)?

Otherwise the this is more compelling for content sites like the referenced
4chan. But still very cool.

~~~
jgrahamc
There isn't a per-site cache in Railgun because it's part of our large shared
in-memory cache in our infrastructure.

Currently, cookies are not part of the hash.

We have customers of all types using Railgun. As an example, there's a British
luggage manufacturer who launched a US e-commerce site last month. They are
using it to help alleviate the cross-Atlantic latency. At the same time they
see high compression levels as the site boilerplate does not change from
person to person viewing the site.

What sort of sites do you think it doesn't apply to?

~~~
justinsb
Surely there is a per-site cache on the origin server (in what you call the
"Listener")?

~~~
jgrahamc
Yes. That's up to the particular configuration of the site. It varies from
site to site, but for optimal results you want it big enough to keep the
content of the common pages of your site.

------
songgao
I'm curious about the crypto part. Could anybody explain to me, if it's a
HTTPS link, where does SSL encryption happen? Does Railgun listener talk with
the origin server over HTTP or HTTPS?

If it's HTTP, then how does CDN handle certificates? Does it use CDN's
certificates?

If it's HTTPS, then 1) Isn't hash gonna be a lot different if if the two
versions are very alike? 2) Why does Railgun encrypt the encrypted data again?

~~~
jgrahamc
The link between CloudFlare and the customer network (i.e. between the two
bits of Railgun) is TLS. We have an automated way of provisioning and
distributing the certificates necessary for that part.

For the connection from Railgun to the origin server it will depend on the
protocol of the actual request being handled. If HTTPS Railgun makes an HTTPS
connection to the origin.

~~~
songgao
Thanks! That makes sense now :-)

------
xanadohnt
The change detection algorithm is clever. But this is a classic memory vs.
processor problem. The real trick here is that the Railgun service instantly
adds massive amounts of cache to your service; it just so happens - if their
claims aren't inflated - adding these additional resources to your service is
transparent. This has nothing to do with Railgun being developed on Go.

------
tuxidomasx
Other than general traffic data compression, I've always been somewhat
interested in html compression in particular.

I know lots of webservers zip their response data, but I was always curious
about the things in html that show up very often and if there's a way to
optimize around that.

For example, most web xml data contains a lot of common tags, like "div" and
"span" and others that are specific to html. I think if you add them up, they
might make up a considerable percent of traffic data. Is it possible for the
web server to swap those out for a single character before it sends the data,
and have the browser replace it when it arrives?

Or does zip compression already do that somehow?

~~~
wisty
Yeees, no.

Zip will replace the common tags (like "div") with a single "div" (in the
compression dictionary), then a single character every time it appears (more
or less - it might be less than a single byte if it's a _really_ common tag).
So there'll be a wasted overhead of a dictionary of common tags (which is kind
of wasted).

It would be more efficient if both the browsers and compression algorithms
could agree (beforehand) had a dictionary of common terms which would be
likely to appear in the document.

If you're compressing a lot of data which is likely to be similar, you can do
this with a common dictionary. See -
[http://stackoverflow.com/questions/479218/how-to-compress-
sm...](http://stackoverflow.com/questions/479218/how-to-compress-small-
strings)

Of course, my answer on Stackoverflow is pretty crude. You could create a
dictionary used to compress the compression dictionary. Google is probably
going to do this any time soon (if they haven't already) since they control
the client (Chrome), server (google web server) and protocol (SPDY).

------
shotgun
I see that the article is tagged "open source." Is CloudFront going to open
source Railgun? Publish any papers?

This isn't an announcement about companies supporting Railgun...it's about
companies supporting CloudFlare by installing the Railgun Listener.

------
cobrabyte
This is the third time this week that I've read or heard about Communicating
Sequential Processes (CSP), the formal programming language devised by Sir
Tony Hoare.

Third time's a charm. Definitely going to have to investigate.

------
coolj
> Today, [cloud providers Amazon Web Services and Rackspace, and thirty of the
> world’s biggest Web hosting companies] announced that they will support
> Railgun...

I can't find any such announcements; anybody have links? Based on comments
further down, I wonder if the author is confused.

> CloudFlare will provide software images for Amazon and RackSpace customers
> to install

That is very different from the claim in the first paragraph.

~~~
eastdakota
Amazon and Rackspace customers need to install the software themselves (for
now). The other listed hosts have made it one-click simple without the
customer having to install anything. A couple announcements from major hosts
today:

Dreamhost: [http://dreamhost.com/dreamscape/2013/02/26/cloudflare-
railgu...](http://dreamhost.com/dreamscape/2013/02/26/cloudflare-railgun/)
Media Temple: [http://weblog.mediatemple.net/2013/02/26/the-web-just-got-
fa...](http://weblog.mediatemple.net/2013/02/26/the-web-just-got-faster-with-
railgun/)

------
pjmlp
Another Go PR story.

Same thing could be easily achieved using futures or any of the asynchronous
libraries available to C++, Ada, JVM and .NET languages.

------
DoubleCluster
This is WAN optimization, right? This is already being done but usually for
(VPN) connections to other branches of a company.

~~~
peterwwillis
No. This is basically binary diffing and compression.

Edit: err, you are correct, I didn't realize WAN optimization included binary
diffing and compression. Should google before I comment.

------
zobzu
Uho. Binary protocol. The problem being, it's actually bringing financial
advantages over HTTP. HTTP has the advantage of being standard, simple, plain
text and thus easy to work with.

Hopefully http2.0 will attempt solving this.. erm...

~~~
radd9er
is your concern that the proprietary protocols will take over the web?

~~~
zobzu
not in particular. complex binary protocols while slightly more efficient are
much harder to use, understand, and design properly.

------
philiac
The article mentions how this compression technique is similar to image
compression. Would anyone care to explain, in detail if necessary, how this is
so? Thanks.

~~~
radd9er
I think its because a whole bitmap isnt streamed for every new frame, just a
diff telling the player about the parts of the map that need updating.

------
jamieb
FTA: "If it was written in C++, it would be threaded code"

Uh, why?

~~~
jussij
Because that’s one approach to getting the most out of all of those multiple
core CPU servers.

For Go that came for free because its Communicating Sequential Processes
design does that for you.

~~~
abraininavat
Came for free? Go takes advantage of multiple cores by using threads. CSP
doesn't magically multiplex your code onto your cores.

~~~
jussij
> CSP doesn't magically multiplex your code onto your cores

Take a look at this Rob Pike video:
[http://blog.golang.org/2013/01/concurrency-is-not-
parallelis...](http://blog.golang.org/2013/01/concurrency-is-not-
parallelism.html)

Now that video might well be crap, I'll be the first to admit I'm not skill
enough to know one way or the other.

But based on that video, it does appear to me that Go does offer some form of
multi-core magic and it does appear to come at a minimal cost.

~~~
abraininavat
It's not magic. It's threads. Go multiplexes your goroutines onto N OS
threads. There are also abstractions in C/C++ (though of course as libs, not
part of the language, like in Go) which hide the usage of threads. But there
is no magic. If your code is running in parallel, your code is using OS
threads.

------
corresation
I was just looking into what SDCH is (an accept-encoding option from Chrome)
and it sounds very, very similar: It generates a dictionary and then uses
VCDIFF between requests. Is this related somehow?

~~~
jgrahamc
Vaguely. Both Railgun and SDHC work by compressing web pages against an
external dictionary. In SDHC the dictionary must be generated (somehow), and
it is intended for use between a web server and browser. Railgun is back-end
for our network and automatically generates dictionaries.

[http://calendar.perfplanet.com/2012/efficiently-
compressing-...](http://calendar.perfplanet.com/2012/efficiently-compressing-
dynamically-generated-web-content/)

~~~
jws
Is anyone aware of a performance analysis between SDCH and one of the dynamic
compressions like deflate?

I google, but all I find is people complaining their
proxy/filter/appliance/diagnostic is breaking because it doesn't understand
SDCH.

It seems like SDCH has been around for 4 years, I presume the lack of data
means it hasn't worked out.

(I imagine that you could drastically reduce the CPU load of compression by
making simple hard coded state machines for each dictionary. For content like
XML or json you could easily make your field names and surrounding punctuation
minimal. For many very short messages sharing a dictionary that would beat
deflate on compression ratio, and for long messages of non-repeating field
values it wouldn't be much worse. CPU use of expansion is probably comparable,
though you might get better memory access behavior out of SDCH.)

~~~
corresation
What you describe is exactly what I've been looking for. There are remarkably
few resources on this.

We have users in Singapore who access various XML-heavy web services in our NY
office. A dictionary-style over-multiple-requests compression technique would
be brilliant for their case.

~~~
packetslave
Take a look at the various WAN accelerator appliances (Cisco, Silverpeak,
Riverbed). They do almost exactly what it sounds like you want (if I'm
remembering back to my evaluations, Cisco at least uses a multi-request
dictionary for their compression)

~~~
emmelaich
I was going to mention Riverbed, glad someone else did.

They've saved a huge amount for us (I think of 90%) of AJP (http<->tomcat)
traffic. Not particularly difficult to set up.

~~~
corresation
Riverbed looks ideal but aren't they incredibly expensive? We looked at it
years ago and I believe the necessary endpoints in our data center and in
Singapore pushed past $140,000.

~~~
packetslave
Riverbed (and all of the players in this space, really) are quite expensive,
but this is where you get into the whole "Total ROI" argument for justifying
the purchase.

Most companies depreciate hardware over 3 years. How much WAN/Internet
bandwidth will you NOT use over the next 3 years, and how does that translate
into upgrades you won't need to make?

There are also arguments for these boxes along the lines of "right now we use
really expensive WAN links, but these boxes do end-to-end encryption too, so
we can put the traffic on the Internet instead" but that opens up a few
obvious cans of worms (and can of course be done without an accelerator with
VPNs and whatnot).

Then you get into the more nebulous arguments that big bosses tend to like,
such as "The average user makes Y XML requests per day to process X widgets.
Each request takes Q seconds now. If we lower that to Q*0.5 with WAN
acceleration, each user can now process N more widgets per day". Fluffy
argument, but can have a big impact on business decision makers, especially if
you can tie it to a dollar amount.

Note that WAN Accelerator salespeople are really, really good at coming up
with arguments like this for/with you during the sales process.

