

Using Flat Files So Elections Don’t Break Your Server - tysone
http://open.blogs.nytimes.com/2010/12/20/using-flat-files-so-elections-dont-break-your-server/?src=tptw

======
jbyers
It's nice to see detail like this on high-traffic, high-risk environments.

I'm curious about their provisions for cross-datacenter failover. The article
mentions haproxy being ready to direct requests to a different datacenter as
well as ELB spanning availability zones. I'd expect a failover option entirely
outside AWS as well, with short-TTL DNS ready to make the switch.

I'm also not sure what value varnish brings to the table when the Apaches are
just serving a small number of flat files. It seems like unnecessary
complexity -- by the same logic found earlier in the article -- when a well-
tuned webserver will serve flat files at a comparable rate. Maybe the Apache
configuration for this workload was sufficiently different to make it an
unwanted risk.

~~~
jemfinch
Doesn't Varnish handle many times more concurrent clients than apache does, at
significantly lower system load? It could just be pure optimization.

~~~
jacquesm
It does. Apache is strictly one process per connection, varnish handles
hundreds, thousands or even tens of thousands of connections with just one
process so the overhead is minimal.

~~~
jbyers
Apache has not strictly been one process per connection since 2.0. See the
worker MPM. I would have chosen nginx, but Apache can be configured as a
capable static file server.

~~~
jacquesm
Even then each thread is still one connection (according to the apache docs).

~~~
pquerna
Except in the Event MPM

------
jasonkester
Ugh. They're not talking about Flat Files. They're talking about HTML. Saying
so in the headline would have saved a lot of confusion.

A flat file is essentially a .csv holding data, and can be a fast way to bulk
load denormalized data. As such, it actually does have a use in the context of
scaling, so it's natural to expect that they were using the term correctly.

A _static_ file, or more simple, a _HTML_ file is what they're actually
talking about. As they've noticed, it's what web servers are best at serving,
and it scales obnoxiously well.

Now that we're all talking about the same thing, I can say that I've been
serving all my product blogs as .html for years, and have never had any of
them fall down under load.

It's really easy to get something in place to generate static files from a
blog or CMS. I do it with:

\- a 404 handler that maps missed requests for .html to their equivalent
generator.

\- a regular old blog engine that takes an extra parameter
"writeThisToHTMLOnceYouveRenderedIt"

Future visitors skip the redirecting and generating and are simply served the
static file. Next time you edit the content, you can simply blow away the
whatever-blog-entry.html and index.html and know that they'll show up again
next time anybody asks for one of them.

~~~
masklinn
> A static file, or more simple, a HTML file is what they're actually talking
> about.

Static yes, HTML no, an HTML file could still have e.g. SSI instructions. It's
valid HTML, but if the webserver supports SSI it's not going to be static.

~~~
idive
Well, now we're arguing semantics, but no... an SSI instruction is not part of
HTML. It's a server-side scripting language. I could set up my server to
interpret PHP embedded in .html files as well, but then they're not really
HTML files any more.

------
snprbob86
Slightly off-topic, but the title of this blog (column?) is amazingly clever:

"All the Code That's Fit to printf()"

~~~
PostOnce
Your threshold for _amazing_ is pretty low. :P

~~~
patio11
Would your impression of the pun be improved if I said that the NYT's motto,
on every masthead for over a century now, has been "All The News That's Fit To
Print"?

~~~
PostOnce
I understood the joke perfectly, it's just not that witty...

------
skorgu
I approve of this level of paranoia: It's never failed, not even once but
_just in case_...

------
jrockway
I have always been a proponent of the 5 second cache for things like blog
posts. After you hit Reddit, you can see 1000 requests a second for a few
minutes. If you cache for 5 seconds, nothing is ever very far out of date, but
you save yourself 5000 requests. Seems like a win-win.

(I'm also a big fan of Varnish. I tried it out this weekend on a site that
basically serves an HTML file that says "hello world". Apache can do about
11,000 requests a second, but Varnish can serve 15,000 requests a second.
Excellent!)

~~~
swombat
You get 1000 requests a second from Reddit? Wow, it's a whole lot bigger than
I thought.

~~~
jrockway
Now that I am remembering better, I actually hit the front page of Reddit and
del.icio.us at the same time. Those both drive a surprising amount of traffic.
(The article was "Git merging by example", now archived here:
<http://blog.jrock.us/posts/Git%20merging%20by%20example.pod>)

I did not use varnish at the time, but I basically kept up with the load. My
pages were cached with an in-memory cache in the app layer, which worked well
enough, I guess. As blog.jrock.us mentions, I was unhappy with the design of
my software, so I took it down. Two years later, I almost know what a good
design is, and so I should have a blog again soon. But I digress :)

------
superjared
This is exactly why I created StaticGenerator for Django:
<https://github.com/luckythetourist/staticgenerator>

~~~
mikeytown2
On the drupal side there is Boost: <http://drupal.org/project/boost> It has
some fairly smart cache invalidation logic as well. At work we use the cache
invalidation logic to tell varnish what needs to be refreshed. We use this
setup for over 1,000 TV station news sites; supported by 10 boxes total. Flat
file cache size is around 80GB. Our load is usually under 2 on all boxes.

------
beej71
Forgive my ignorance, but does "flat file" mean "prerendered static HTML page"
in this usage?

~~~
brown9-2
Sounds very much like it. From the article:

>After each batch of new data was received from the AP, this server determined
which pages needed to be re-rendered and, using the Typhoeus libcurl-multi
bindings for Ruby, pulled new data for each of these pages from the render
pool.

Sounds like their infrastructure already has a farm of servers set up to
handle rendering articles into their HTML components, so for this scenario
they would save the output as a static file and then write it to disk rather
than sending the HTML out to the response / caching layer / etc.

~~~
beej71
All righty... it was just a usage of the term "flat file" I was unfamiliar
with. :-)

~~~
julian37
Yeah, I think they made that up. I've only ever encountered the term "flat
file" in the sense of a flat file database, such as a CSV file. As you say,
pre-rendered static page is a much better name for it.

~~~
sambeau
I have always used the term "static html" (as opposed to "dynamically
generated html").

"Flattened files" may have been what they were meaning. "Flattened Files"
would be a nice simple term for them if there wasn't confusion with Flat-file
databases.

~~~
beej71
Ah! Now it makes more sense, etymology-wise and all that.

------
cagenut
the filesystem is just another datastore. using it like this means you're
spreading out the requests per second across each individual server's
available IO, however you've also forsaken the "getting data from point a to
point b" features of other datastores and therefore have to do it yourself
(usually rsync).

to be honest, since there wasn't actually a problem to solve as varnish is
setup as both an HA environment and to use the grace/saint features, this is a
case of overengineering in my book.

~~~
sambeau
The file-system isn't just another data-store: it's I/O and therefore special.
On systems with 'sendfile' data can be sent straight from disk to network port
without CPU time (and any copies being made). This is used extensively by
high-performance webservers and web-accelerators.

~~~
dedward
sendfile() still requires CPU time - it just passes the job of sending the
data over the socket to the kernel to finish up. The previous method would
have been a select() loop or similar that ensured the data was sent out to the
kernel, involving a bunch of system calls and context swtiches. sendfile lets
you hand the job off 100% to the kernel and have your thread move on.

~~~
sambeau
I was under the (possibly mistaken) impression that Direct Memory Access (DMA)
hardware could move data straight from the Disk Cache to the Network card
without touching the CPU Caches at all.

~~~
dedward
True - DMA can do that - but then you'd have to have something managing the
network stack - the kernel manages tcp/ip.

From the man page:

"sendfile() copies data between one file descriptor and another. Because this
copying is done within the kernel, sendfile() is more efficient than the
combination of read(2) and write(2), which would require transferring data to
and from user space."

------
ajays
I may sound naive here, but: if all you're doing is serving just 184 flat
files, then why do you need all this RoR jazz? Can't a bunch of Apache/Nginx
servers behind a load balancer (Varnish, TrafficServer, etc.) handle that just
fine? Throw in a DNS-based scheme for failover/redundancy?

~~~
necubi
Ruby on Rails is generating the flat files, while apache et al. are serving
them.

------
dedward
I'm really curious now how many other large setups use a similar setup
(memchache, varnish, apache, rails) for their sites... anyone know a good
sites that's put together any statitstic?

