
Lessons from a Google App Engine SRE on how to serve over 100B requests per day - rey12rey
https://cloudplatform.googleblog.com/2016/04/lessons-from-a-Google-App-Engine-SRE-on-how-to-serve-over-100-billion-requests-per-day.html
======
dekhn
I used to be a scientist and I went to work at Google to apply their
technology to science problems. My first team was SRE- and I have to say,
Google's SRE approach to computing completely changed how I thought about
things, and more importantly, how I programmed systems that went to
production. I've read the SRE book and can highly recommend learning from the
principles it lays out.

~~~
kiloreux
Can you please give reference to that book ?

~~~
dekhn
It's the one mentioned in the interview, [http://www.amazon.com/Site-
Reliability-Engineering-Productio...](http://www.amazon.com/Site-Reliability-
Engineering-Production-Systems-ebook/dp/B01DCPXKZ6)

------
mikecb
The book they mention[1] is very good so far.

[1][https://play.google.com/store/books/details/Betsy_Beyer_Site...](https://play.google.com/store/books/details/Betsy_Beyer_Site_Reliability_Engineering?id=tYrPCwAAQBAJ)

~~~
thedevil
Upvoting because this link is cheaper than the Amazon link.

~~~
pkaye
If you are willing to wait, many times of the year O'Reilly will have their
ebooks on sale for 50%-60% off. The best time is black friday.

------
i336_
> Advance preparation, combined with extensive testing and contingency plans,
> meant that we were ready when things went [slightly wrong] and were able to
> minimize the impact on customers.

Following that link provided some interesting reading (for a mundane error
report, at least): [https://groups.google.com/forum/#!msg/google-appengine-
downt...](https://groups.google.com/forum/#!msg/google-appengine-downtime-
notify/T_e7lVg7QNY/lifcqZBcracJ)

TIL that even Google have datacenter fluctuations they can't figure out. It's
nice that they quietly make this info publicly available, and also nice that
I've now discovered where to find it :)

~~~
desarun
I love their solution.

They turned it off, then on again

------
bobp127001
I've been seeing a lot of references to SRE recently. Is Google trying to
market this position and acquire more engineers?

The SRE book, and Google in general, have mentioned that SREs are notoriously
hard to hire, and I'm wondering if they are doing a marketing push.

~~~
thrownaway2424
SREs are very hard to hire, speaking from experience. At Google SRE directors
and VPs will often cherry-pick promising candidates from the mainline SWE
hiring pipeline and give them a "hero call" to convert them to SREs. SREs at
Google are also paid more, controlling for level and performance, as a way to
hire and retain.

~~~
Reedx
Interesting. Can you expand on "hero call"? What does that entail?

~~~
agentultra
Donning a cape and meeting destiny.

In all seriousness they make it out to be more than it is. From my experience
going through their hiring pipeline there seem to be two tracks in SRE;
software and sysadmin. If you score higher in algorithms and data-structures,
presumably, you'll end up working more on tools and libraries whereas in the
other you'll work more on infrastructure and automation. Either way both
tracks work together on the same team towards the same goals.

If you want in be prepared to solve simple-to-tough algorithms problems and be
quizzed on TCP re-transmission, Linux system calls, and memory pressure. It's
a bit challenging because you not only have to know Big-O well enough to
estimate the asymptotic complexity of an arbitrary algorithm but you might
also be asked what a sequence of TCP packets would look like if you sent some
data and pulled the plug or what the parameters are to a given system call on
Linux. You quite literally have to know everything from how virtual memory
works, how to implement a fast k-means, how the network stack works from top
to bottom, etc, etc.

If you've done any work in cloud development and supporting moderately large
one it's that but bigger. Make one a hero, it does not.

------
nunez
I'm actually really really glad that Google released this book because I think
they are one of the few companies that is actually doing this SRE thing right.
I think the hardest bit about the SRE paradigm (like DevOps) is having
companies wholly adopt it, and I think that this book being out will help
change that.

------
tdmule
This got me wondering what the AWS services' work load per day was. Best
numbers I could find were from this 2013 article about serving ≈95 billion
requests per day for just S3. The size and scope of cloud providers is truly
cool and fascinating engineering.

[https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-
obje...](https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-
objects-11-million-requests-second/)

------
mtgx
I don't know why this isn't on HN, but this is another interesting post from
the Google Cloud Platform blog from today:

[https://cloudplatform.googleblog.com/2016/04/Google-and-
Rack...](https://cloudplatform.googleblog.com/2016/04/Google-and-Rackspace-co-
develop-open-server-architecture-based-on-new-IBM-POWER9-hardware.html)

~~~
rey12rey
It is ->
[https://news.ycombinator.com/item?id=11440179](https://news.ycombinator.com/item?id=11440179)

------
ec109685
s/lessons/lesson/

"If you put a human on a process that’s boring and repetitive, you’ll notice
errors creeping up. Computers’ response times to failures are also much faster
than ours. In the time it takes us to notice the error the computer has
already moved the traffic to another data center, keeping the service up and
running. It’s better to have people do things people are good at and computers
do things computers are good at."

------
yelnatz
1,157,407 requests per second.

~~~
iLoch
One Node.js server can do 10x that! /s

------
thirdreplicator
Only 100 bytes? That's easy... Sheesh.

