
How to build a news app that never goes down and costs you practically nothing - llambda
http://blog.apps.npr.org/2013/02/14/app-template-redux.html?
======
danso
I had the pleasure of working with NPR's data news chief, Brian Boyer, who
taught me a lot about actual good software practices.

I agree with his preaching the power of flat-files. Not that flat-files should
be used to do things that they inherently can't...but that too many projects
(or hobby apps) don't consider them and then spend as much time figuring out
how to keep their server from crashing. I find it pretty amazing that they
have only one small EC2 instance for their news apps (this is separate from
the NPR.org site overall) just to do cron jobs.

Flat files, of course, require good planning...not least of which involves an
accurate gauging of how often an app's data needs to be refreshed. But I like
that kind of planning and thinking more than I do the kind it takes to
maintain a stable server.

~~~
ericcholis
Recently I re-engineered an ecommerce site that takes many things from the
"flat-file" montra. It's amazing how little database interaction you actually
need for many sites.

I think the database-driven content comes primarily from kitchen sink packages
like Wordpress. More often than not, it's better to get an overall look at
your structure and decide what needs to be dynamic and what can be "static".

When talking about it internally, we refer to "flat-file" as "generated".
Simply meaning that it's dynamically created by a task or user interaction.

~~~
wting
The file system is the original "database". I solved an interview question
recently using a few Linux utilities, pipes, and flat files. It was a bash one
liner that impressed the interviewer, but then he asked me to solve it the
"real" way.

Both databases and file systems use B-trees to implement fast read / writes,
it's just that SQL databases enforce structure. OTOH, using flat files moves
data checking into application space and losing certain ACID properties.

Similar trade offs are made between NoSQL and SQL, dynamic and static
languages, but I digress...

~~~
ewang1
I would imagine "writes" are still being written to the database, which then
triggers the re-generation of the appropriate flat-file(s).

~~~
wting
No, given a bunch of Apache logs I needed to find the top 10 queries that met
conditions A and B.

1\. grepping for conditions and extracting the query with sed

2\. appending a char to a flat file (i.e. increasing query count by 1)

3\. sort by largest files and get the top 10

Parsing log files[0]:

    
    
      for ./*.log -type f -exec echo 1 >> $(grep cond_A {} | grep cond_B | sed -E "s:.*(query).*:./results/\1.txt:") \;
    

Finding top 10 results:

    
    
      ls ./results/*.txt -Sl | head
    

[0]: Untested code, and it only grabs one line per log file instead of
grepping all matching lines in the log file. I'd have to move the grep to the
outside and loop the `echo 1 >>` command, but you get the point.

~~~
mbreese
You probably could have done the same thing in one line:

    
    
        cat *.log | grep condA | grep condB | sed 'some regex to get rid of dates, etc' | sort | uniq -c | sort -nr | head -n 5
    

Or something like that... Not very efficient, but it would work in a pinch. I
actually do something like this all the time for large datasets. In the time
it would take me to write something better, this set of commands is already
done.

------
willholloway
This is an excellent post. I used almost exactly the same system to build my
app streamjoy.tv . Everyday streamJoy scours Netflix, iTunes and Amazon to
find the availability of movies for streaming, rental and digital purchase.

I built it as a portfolio piece and haven't finished it because I've been
doing consulting jobs, but if you want to watch a particular movie online
without going the pirate route, it's the start of a legal alternative to the
old sites like sidereel.

The tech behind it is the same as this article. Flask and jinja render a
static html page for each of about 92,000 movies.

I use a flask app for the search functionality that accesses an elastic search
database.

I used mongodb because it was incredibly easy to create a local cache of the
JSON data I was getting from the apis I was accessing.

It all took a lot longer than I ever would have though to even get it to this
point. There were a lot of little annoying issues with the various APIs I had
to access, and the annoyance of parsing XML amongst other schelps I had to
deal with.

I have only ever mentioned it on hacker news one other time and the last time
my elastic search server crashed from the traffic. It is all running on a $5 a
month digital ocean vps.

The flask/jinja static page creation is rock solid and would never fail if I
pushed it to s3, right now my elastic search server is the bottle neck. I
haven't taken the time to throw hardware at it or set up clustering.

All in all it's a pretty cool service in my opinion, I built it for myself
because I love movies and spend a lot of time watching them online and made a
decision to never pirate a content creators work again. Also the experience of
netflix, amazon and itunes is orders of magnitude better than the old
megavideo/bittorrent trouble of finding the real deal and not being inflicted
with spammy ads with voiceover.

I really like the flask/jinja/bootstrap/javascript/mongodb/elastic search
stack. I've learned a lot of tips and tricks by building streamJoy and if
people want I would be happy to share them with the community.

I know this sounds like self promotion of my app but I haven't even taken the
time to implement affiliate tracking for any service besides amazon.
Consulting is serious and real money right now and that takes priority over
this little side project I did.

~~~
abraininavat
Just curious. Could you explain this part? _I used mongodb because it was
incredibly easy to create a local cache of the JSON data I was getting from
the apis I was accessing._

If you are using the data from the various APIs to render static pages, why do
you need the local cache of the JSON data? And if you've got a local cache of
all the JSON data, why render static pages rather than serve dynamic pages
that reference the DB of cached data?

~~~
willholloway
Sure, I would be happy to explain.

> If you are using the data from the various APIs to render static pages, why
> do you need the local cache of the JSON data?

Because the data that I am displaying for each movie is not available from any
single source. streamJoy is the abstraction layer tying together the disparate
data sources.

Mongo was the right choice because instead of having to map out postgres
schema, I could just store the entire JSON response as a dict.

What I have learned from all of this is that API's are not always reliable.
The information changes, you don't always know what you are going to get back.

Mongo just made this early process where I didn't know what I was going to get
much more fault tolerant. And, no schema mapping.

> And if you've got a local cache of all the JSON data, why render static
> pages rather than serve dynamic pages that reference the DB of cached data?

Rendering static pages allow you to use S3 or nginx to serve your html. For a
one man operation like I am running, S3 is manna from heaven for scaling the
serving of html files.

I'm using nginx right now only because I haven't taken the time to use S3, but
S3 is the smarter choice here.

My goal with this was to have the smallest possible dynamic server footprint
as possible.

The other reason for caching json data is I run analysis on the db items, even
though I haven't published any of those features yet.

By having my own copy of the data, I can run a process on 92000 items in 6
minutes instead of taking a day due to API rate limits.

~~~
derefr
> For a one man operation like I am running, S3 is manna from heaven for
> scaling the serving of html files.

Have you tried putting CloudFlare in front of it? Buttered jelly-roll manna.

------
SmileyKeith
Am I the only one who thinks NPR is awesome for having a Github profile with
seemingly good and useful code on it?

~~~
jeremyjbowers
We're trying to do as much in the open as we can! The way we figure it: If
public media can't share code, who can?

~~~
SmileyKeith
That is so awesome. Thank you.

------
nicpottier
A very pragmatic approach for serving tons of static content. S3 transfer
costs are still pretty competitive, doubly so if you compare them to not just
your own iron but the expertise on staff to maintain and scale it all.
(unlikely you would ever match the S3 uptime even if you did)

~~~
iharris
Not to mention that S3 buckets tend to remain online when EC2 instances in US-
east are exploding due to <insert a problem related to EBS, network, or
datacenter power failure>. :)

~~~
andrewmunsell
The last major outage for S3 was something like 4-5 years ago, IIRC. Coupled
with CloudFront or another CDN, your site probably would be extremely
resilient to traffic spikes or hardware/datacenter issues.

------
andrewmunsell
It's interesting to see a site like NPR handle a site deploy like this. I've
seen blog owners start to consider switching to a static website, but news
sites are definitely a bit more difficult to maintain like this.

Personally, I use Jekyll on my own blog in a similar manner
(<http://andrewmunsell.com/>).

< ShamelessPlug >

I also wrote a tutorial (<http://www.andrewmunsell.com/tutorials/jekyll-by-
example/>) about using Jekyll, in case you want to try something similar to
what NPR did, but with a different platform.

< /ShamelessPlug >

~~~
whimsicality
It's worth noting that our main news site, npr.org, is not all served from
flat files. This is specific to our news applications team (apps.npr.org) and
our client-side projects.

~~~
andrewmunsell
Which does make sense, since it's a constantly updated site-- regenerating all
of the files (to update "what's new" lists and such) each time an article is
written would be major overkill.

While I'm sure you guys already do this, proper caching can have a similar
effect to a completely static site in terms of performance.

------
marknutter
Someone help me understand; the flat assets are hosted on S3 but how do http
requests get resolved to the correct html file? Is it done with DNS settings?

~~~
Rabidgremlin
Yes with a CNAME and some metadata on your S3 buckets. See here
[http://aws.typepad.com/aws/2011/02/host-your-static-
website-...](http://aws.typepad.com/aws/2011/02/host-your-static-website-on-
amazon-s3.html)

~~~
aninteger
Can you help me understand why someone would host static content on S3 versus
any other host? Is it because S3 automatically scales up when traffic
increases? I know almost nothing about S3.

~~~
pjscott
The pricing is competitive, its easy and reliable, and it can handle as much
traffic as you like. S3 may not be the best option, but it's definitely a
good, no-worries option.

~~~
TillE
"Competitive" is an interesting word to use. For example, I can get 100Mbit
bandwidth from Hetzner for $9/TB. Transferring the same amount of data over S3
would cost me 10x more.

Now, if you're serving large files and you really need more than 100Mbit
sustained, S3 makes sense. But it's unquestionably a premium service for a
premium price.

~~~
shorthack
I've just created an account just to comment on that. I'm working for
different NGOs helping them with their IT projects. Recently one of them asked
me to create an equivalent of iTunes store (with the multimedia files
available for free to the members). One of their guys was hyper enthusiastic
about S3. But no matter how we calculated it, Hetzner was (much) cheaper. Now,
the service has been running for over a year on Hetzner services and everybody
is very happy. There was one incident (the motherboard on the server died -
they replaced it very quickly), that's all. We have all the options we need.
Note that the current setup is fairly small: we're serving ca. 4 TB of data to
ca. 4000 members.

Now, when people say that S3 scales well, I totally agree. But why do they say
that the prices are competitive, that's beyond me. Take Xirra's XS-12 storage
(200€/month for 36 TB) and compare it to storing 18 TB (I assume RAID 1) on
S3. 1 TB costs $95 on S3! How on Earth can you call it 'competitive'? Now,
that's even _without_ bandwidth costs. I totally agree it's a premium service
for a premium price. There are plenty of cases when using S3 is just a big
waste of money. (And Amazon's decision not to implement cost capping isn't
helping either.)

------
chadmaughan
This is beautiful. I love NPR.

I'm curious, what are the rules/requirements for initiating a new "NPR app"?
An election app seems totally obvious, but what about other apps? Is it based
on available data? Available funds? Pervasiveness of a certain story? An
individual reporter's weight? (for example, if I was on the team and Nina
Totenberg made an app request, I'd drop everything and do it for her - she's
dreamy)

Also, how much lead time do you typically get with your apps? A few days, a
few weeks, longer?

------
dryan77
We did a lot of this at the Obama campaign as well. Can't recommend it enough.

------
LAMike
Can someone explain the concept of "flat files" to me and why people like to
use it?

~~~
willholloway
Instead of dynamically generating HTML and JS on each page view, you run a
render process that creates html files and serve them statically until the
next refresh cycle.

Its the difference between running a wordpress server and a jekyll blog.

The security and scaling benefits are immense. With javascript you can
replicate a lot of the functionality of dynamic sites.

~~~
LAMike
Oh I think I get it, instead of waiting for a user to click on a page and then
call the server to render that page, the flat file method renders all the
pages at once w/o worrying about which pages the user will click on...
Correct?

~~~
njharman
I say this in "how far have we come" "we've come full circle" and not in any
way as disparaging.

As someone who remembers Perl CGI's ability to serve dynamic content as being
revolutionary and awesome. It amazes me to hear someone who's only experience
is dynamic content and needs http _file_ serving explained to them.

~~~
LAMike
Haha I began to code last year and got started with backbone so this flat file
paradigm is a little new to me :0

------
mckoss
Is there still an issue with GZIP encoding S3-delivered static assets? From my
reading, it looks like Amazon will not automatically convert assets to to GZIP
encoding when the browser indicates support. Rather, you have to upload GZIPed
files and manually configure the content encoding headers. This approach would
be broken for browsers that don't support GZIPed files.

~~~
driverdan
Which browsers don't support gzip?

~~~
wahnfrieden
[http://www.stevesouders.com/blog/2009/11/11/whos-not-
getting...](http://www.stevesouders.com/blog/2009/11/11/whos-not-getting-
gzip/) (just googled your post, this was pretty informative)

~~~
driverdan
Thanks, but that's from 2009 and isn't entirely relavent. Those stats are for
incoming requests that don't specify that they accept gzip, not that they
can't use it. Often this is caused by proxies stripping headers from the
request.

I remember reading an article, I believe by Google, about using browser
sniffing instead of encoding headers to determine if gzip should be used. I
can't find it now but the conclusion was that it's almost entirely safe to use
gzip 100% of the time.

------
clint
We did this a ton when I worked at Ars Technica. Flat files should definitely
be one of the things in your toolbox you consider first before moving on to
more complicate schemes!

------
mati
And what about handling HTTP 500 errors? Although rare, they can still happen.
And you cannot make the user's browser retry in such case (as you could do if
you were making the request in your app). Amazon's Best Practices document
<http://aws.amazon.com/articles/1904> clearly states that that's what you
should do:

 _500-series errors indicate that a request didn't succeed, but may be
retried. Though infrequent, these errors are to be expected as part of normal
interaction with the service and should be explicitly handled with an
exponential backoff algorithm (ideally one that utilizes jitter)._

However, I still think hosting on S3 is a great option. They are pretty
reliable anyway.

------
Tactic
I ran a site (stomped.com) in the late '90s and we did this. We were serving
up news items to millions of unique visitors a month and hitting the DB on
each page hit. Rather than try to implement a bunch of caching methods we went
with generated HTML files. Given how often content changed (a few times an
hour at most) it seemed a waste to generate 10s of thousands of db/cache hits
when things rarely changed. Simple. Stable.

------
tantalor
> Compile app_config.py into app_config.js so our application configuration is
> also available in JavaScript

Terrifying! Why do you need to specify your configuration in code? I would
think configuration as data is simpler.

~~~
0x0
Why not? With config as data, you need an extra parsing step. More moving
parts, more maintenance, and the same end result.

~~~
andrewmunsell
To be fair, that parsing step is still here-- it's just being parsed into
Javascript.

~~~
pyre
If the configuration was data, then you would have to parsing steps:

    
    
      Data => Python
      Data => JavaScript
    

With this setup they only have one:

    
    
      Python => JavaScript
    

Granted, storing it in YAML/JSON/whatever means that you could potentially
have many different codebases / languages reading it without a Language A =>
Language B conversion. It just depends on what works best for your project /
team.

------
nickmerwin
NPR.org itself has a great JSONP API that makes apps like the one in this
article possible. Here's a site I put together that uses it and HTML5 audio
for a very quick and minimal NPR listening experience (best viewed on iOS
safari):

<http://npr.io>

It's a Jekyll app mostly written in CoffeeScript, deployed to S3 with
CloudFront CDN'ing.

Here's a lengthier introduction for anyone interested:
<http://nickmerwin.com/2012/02/25/npr-io/>

------
stcredzero
_> There are three salient Boyerisms I’ve picked up in my month as an NP-
Rapper that sum up these differences...On our team, these Boyerisms aren’t
just preached — they’re practiced and implemented in code._

Summary: Go ahead and toss off an unusual term or slang, but don't expect it
to actually _inform_ readers if they can't Google it. If you aren't trying to
inform readers, exactly what are you trying to do?

I found the unexplained use of the word "Boyerism" in this article to be
confusing and unprofessional. I have nothing against the use of slang, or the
expectation that readers will need to Google terms: this is the reality of our
culture in the age of the Internet and excellent search. However, the casual
and unexplained use of a private group's inside joke reveals a _lack of
awareness_ of how search interacts with culture. Is it a deliberate attempt to
confuse and snub the larger audience, or is in unintentional?

~~~
jeremyjbowers
I can attest that it's unintentional. We roll pretty informally. And, frankly,
we're pleased that anyone outside of the small (and insular) news apps
community is reading about our setup.

~~~
pifflesnort
> _We roll pretty informally._

Between the swearing, and the grossly informal dialog when explaining what
should be rational and defensible technical choices:

1) I can tell.

2) It doesn't reflect [well on] NPR.

~~~
monatron
Who determines what the decorum for such a dialog should be? Does the
informality effect in any way the rationality or defensibility of their
choices? Why do people care? I'd much rather read an article in this tone
rather than some stuffy technical blog like many other engineering teams
sometimes put out.

~~~
pifflesnort
> _Does the informality effect in any way the rationality or defensibility of
> their choices?_

Clarity in writing is not a stuffy affectation, but rather is the entire
mechanism by which one both expresses an opinion, and provides an
understandable, rational justification for that opinion. Without this, the
reader is left with nothing but unsupported conjecture, opinion, and emotive
appeals.

------
huhsamovar
So, how does it not go down if you have to ship to EC2?

~~~
justincormack
It is served from S3 (this was not entirely clear).

~~~
jere
Dumb question, but isn't it possible for S3 to go down (or is the uptime just
so high it doesn't matter)?

~~~
jeremyjbowers
There hasn't been an acknowledged S3 outage since April 2009 that we can find.
Also, we use two different buckets in different geographic availability zones
to allow us to stay up if one AZ goes out. Finally, our biggest projects are
cached in CloudFront, which would serve a stale cached item if the backend
were unavailable.

~~~
deskamess
Do you 'invalidate' all files when pushing an update or do you use a low cache
expiry value for CloudFront?

In particular, I am thinking of edge cases where a news story has a typo/other
important correction and you want to update just that story. What is your
strategy? How is it impacted by caching done by CloudFront? Thanks.

------
codebeard
The hell? I posted this two days ago and it got no attention!

That particular NPR blog always has greatly insightful posts.

------
zv
"Servers are for chumps". Well, using EC2 still counts as using servers.

~~~
acdha
Note that EC2 isn't in their critical path: it's a prep stage rather than
serving directly to the masses.

------
EGreg
While I agree with many of the sentiments expressed in the article, I think
"never goes down" and "costs you practically nothing" only seems true while
nothing bad happens.

When a hard drive crashes or a truck runs into your data center (here's
looking at you, Rackspace) or you need failover for any reason, that's when
you wish you had virtualized in more than one machine.

Want something that's always available and never crashes? Look at freenet.
Distributed computing model. If we can failover the DNS, you can have the same
thing on the web.

~~~
pjscott
S3 does replication and failover. That's part of its advantage over using a
dedicated server running nginx or something.

~~~
EGreg
That is what I was talking about

