
How we successfully handled 2.5x traffic in a week - talonx
http://engineering.khanacademy.org/posts/handling-2x-traffic-in-a-week.htm
======
polote
If Khan Academy uses Youtube to serve their video and uses Fastly to serve
static content, what makes it hard to scale ?

I mean being able to scale that easily is a great thing, but is there anything
worth sharing with the world in their case ?

~~~
SilasX
"How to scale: Make it somebody else's problem."

~~~
spyspy
A surprisingly legitimate solution.

~~~
SilasX
Yes, with the caveat that you may have to check that they're actually capable
of handling the load and you don't get a surprise notice that "uh we can't do
this, you're on your own".

~~~
RcouF1uZ4gsC
I am pretty confident that YouTube will be able to scale to handle any
increased load coming from my service.

~~~
judge2020
But they might not at any time, at least without charging. With an apparent
internal push for some services to become self-sustaining (see Google Maps
API, Recaptcha), YouTube embeds might be next.

~~~
dangoor
Khan Academy today supports serving video outside of YouTube, which is blocked
in some schools. We could essentially flip a switch to not use YouTube, but
the cost would be substantial because those videos go Fastly->S3, so anything
not in cache is going to result in S3 egress charges.

~~~
toomuchtodo
Something to consider in your tooling for this is to target Backblaze's B2
object store, which supports an S3 compatibility layer [1] (so you shouldn't
need to change too much code). I'm unsure if Fastly supports B2 in this
configuration yet though.

[1]
[https://news.ycombinator.com/item?id=23069114](https://news.ycombinator.com/item?id=23069114)

~~~
dangoor
Yeah, I'm sure that if we needed to serve more videos through our "fallback
player", we'd investigate more cost effective ways to run it. With thousands
of videos, it'd be a pain to move … but at least that would be a one time
pain!

~~~
mohsenhaddad
Something else that can also help cost and quality is EasyBroadcast’s viewer
assisted streaming in addition to Fastly. Adding a JS to the player pages
enables cdn offloading by making each viewer act as a potential source based
on QoE/QoS metrics. Disclaimer: I am one of the cofounders. Happy to help.

------
xhkkffbf
This is a good example of how cloud tools make this kind of scaling easy.

The trickier part can be the cost-- which this piece notes will increase
roughly linearly with the number of users. If Khan Academy is free, I think
this means those who are generous are going to need to keep giving to keep it
that way. Let's step up, everyone.

------
Ididntdothis
Wouldn’t you deploy generally deploy your services with a certain safety
margin? I am pretty sure most systems I am working on could handle 10x pretty
easily. Then it would get hard but it seems 2.5x is a pretty normal and
expected fluctuation.

~~~
tyrust
Provisioning for 2.5x of peak load would be pretty expensive and, in most
cases, overly cautious (good luck selling 10x to your financers!). For a large
service that wants to be able to handle unexpected spikes, I'd expect
something more like 1.1-1.5x margin.

As their blog post says, they weren't unprepared. They had the means to grow
and used it. Burning money before that time wasn't necessary.

~~~
Ididntdothis
Makes sense. I guess I was thinking about smaller deployments where the
absolute server cost isn't that high.

------
ashtonkem
2.5x is a surprisingly small jump for having all of your brick and mortar
competitors shut down for an indeterminate period of time. Either Khan academy
had amazing penetration into the education space, or the follow through rates
for kids educating at home is abysmally low.

Disclaimer: I don’t have children, so I have no real world experience with
Khan.

~~~
dangoor
There's a lot going on in your observation, and this is all speculation on my
part (even though I am a Khan Academy employee).

We did have quite a bit of usage and awareness among schools already before
the shutdowns started. Couple that with there being many options for teaching
online… I wouldn't be surprised if a lot of schools just switched to having
their teachers attempt to do their normal teaching via Zoom (which sounds
really hard to me!). Many schools had contracts of various sorts with other
online learning platforms.

Some schools or classes haven't had great follow through rates, which is
unfortunate, but educators all over have had to quickly adjust. I suspect that
more robust plans will be in place by the fall, given how much uncertainty
there is for fall classes. Khan Academy is, at least, an always-free resource
that's there for people if they need it.

That 2.5x is starting from a large base, and there's also a lot of activity in
online education generally.

~~~
20years
"I wouldn't be surprised if a lot of schools just switched to having their
teachers attempt to do their normal teaching via Zoom"

This is exactly what is happening in our school district and it is a big
failure. That and teachers emailing their lesson plans for parents to print or
parents can go to the school and pick up printed packets. We then have the joy
of taking photos of the completed work and emailing those back to the
teachers.

It is extremely inefficient and I have already informed our school that we
will not be doing that if we are stuck in this scenario come fall. We will be
using Khan for math and other online learning platforms for LA.

~~~
fludlight
What are the other platforms you mention?

------
hinkley
Khan Academy is actively soliciting donations right now, as is referenced in
the footnote to the article:

> Khan Academy's increased usage has also increased our hosting costs, and
> we're a not-for-profit that relies on philanthropic donations from folks
> like you.

------
cagenut
I love it when people have both the inclination and the political pull to keep
an environment super minimalist like this. Fastly to AppEngine is a blazing
fast combo and so well sorted to "just work".

------
nunorbatista
Khanacademy is great and don't get me wrong, but what I see is an engineering
blog that doesn't force HTTPS and an ad for Google products. All that
Khanacademy did was optimize code, setup the partner console properly and pay
the (likely enormous) bill. As I read in another comment here: "How to scale:
Make it somebody else's problem" \+ pay the bill.

Edit: ah, a case study from Google about Khanacademy. This post was definitely
an ad: [https://cloud.google.com/customers/khan-
academy](https://cloud.google.com/customers/khan-academy) Another:
[https://cloudplatform.googleblog.com/2013/08/khan-academy-
ru...](https://cloudplatform.googleblog.com/2013/08/khan-academy-runs-on-
google-cloud-platform.html)

~~~
talonx
It does enforce HTTPS - what are you talking about?

------
parhamn
Notably:

\- No Rust/Go rewrite

\- GC not disabled

\- Didn't apply the latest research on k/v storage

Jokes aside, this is the fun parts of hosted software and glad to hear the
"things don't have to be so hard" side of things. Hope it continues working
out!

~~~
spyspy
If you're already using GCP, my general advice for new projects is almost
always some form of "just throw it on AppEngine". No, you don't need multi-
region deployments. No, you don't need 32TB of memory per instance. No, you do
not need kubernetes. No, istio is not going to solve this. No, you're not
hosting your own kafka cluster.

I've found devs are always trying to over-engineer complex solutions to dead
simple problems. Just let Google do it and get some sleep.

~~~
realbarack
For new projects sure, but you need an escape hatch. App Engine costs can
spiral out of control. I know of at least one startup that was pretty
successful in finding product market fit but sunk their own ship because they
weren't able to migrate off of App Engine quickly enough.

~~~
cglace
If you run on app engine flexible it shouldn’t be hard to migrate.

~~~
spyspy
Not even flex. AppEngine standard added the ability to deploy containers in
2018.

------
lultimouomo
They are now suffering a major connectivity outage:

[https://status.khanacademy.org/](https://status.khanacademy.org/)

Unfortunate timing for the blog post to reach HN...

------
httpsterio
I haven't actively thought about Khan Academy for several years and only just
remembered its' existence. I do think that it's all sorts of brilliant and
that's why I just signed up as a volunteer translator. I hope some of you
other people here will so the same.

------
programminggeek
On a largely content based app/site, most of "scaling" comes down to caching.
However you do that is up to you, but somewhere between caching at the browser
layer, proxy layer, web server layer, or memcache layer, things should be fast
and scalable without getting too fancy.

------
jve
Some fun facts from scaling and optimizing the dominant school management site
in our country that are used by schools, kindergartens, parents and kids.
Schools years ago no longer may keep physical journals around because they use
this system.

Peak was usually when semester end was coming. Currently, daily, we must
sustain 2x of that peak, but peaked at 3x of that on the first day "remote
schools opened". The everyday traffic is currently like almost 3x it would
have been, measuring by requests per second received on frontends.

We were struggling usually at end of semester. Luckily we stated to do some
upgrades and optimizations to handle that, just before the lockdown started.
In the end, we can now sustain many more times traffic we have. The hardest
part wasn't adding more resources (that was a MUST for sure) but it was much
more effort to handle stuff that scales vertically (SQL) and some file share
issues.

So there were some intermittent issues that were frustrating the users,
sometimes bringing whole site down for minutes. (Sometimes is the trickiest
part to handle). That includes hunting down expensive queries (not so hard),
calling less or more optimal queries, taming SQL plan cache and dealing with
some NTFS stuff for file share.

Some of the issues couldn't be solved solely throwing more hardware in.

I'm still puzzled for the file share issue on Windows. Yeah, not so clever to
store millions of files in a single folder within NTFS filesystem. We have
append-only share, no deletes ever happening. Stuff like timestamps and short
paths were disabled, defragmented $mft and the likes, but... every ~24 hours
the drive would become sooo painfully slow and inacessible. Some access denied
errors get thrown, etc. And it continues for some minutes. Maybe 15. Sometimes
more. But between that ~15min period it works, so it's like a wave with some
period. But the thing is, outside of this window, the file share works very
good (with all those millions of files within a folder). They were never
deleted and we didn't need to enumerate - just fetch file by file.

No, Antivirus wasn't at fault, no backup system wasn't messing around. Is
there stuff NTFS may do under the hood on SSD drives? I'm aware there is a
TRIM process, but as I understand it has to deal when the stuff is being
deleted from the SSD?

We moved to rotating disks (!) and split those files across few folders and
got rid of those issues. But still, 1 folder contains way more files you would
hear anyone saying is healthy on NTFS.

------
sdan
Love these sort of explanations on how companies and people run their infra.
Great job KA!

~~~
jordache
it's a rather high level, low in insightful detail article.

------
freefriedrice
TL;DR: Load Balancers and a clear policy means the cloud works as advertised.

Seven years ago I was at a medical conference in Portland, Oregon with a panel
of "experts" discussing the security and accessibility of medical record
systems and wearable devices. There was a principal engineer from Intel on the
panel. When someone asked about the cloud, this tall, lanky, long-bearded man
with a thick accent stood up and said:

"The cloud? (chuckles) What is the cloud? Where is the cloud? Is it over here?
(Points to a table) Is it over there? (points to another table). The cloud is
a joke, man. It's a complete joke."

EDIT: added an anecdote for SEO. :)

------
cheungyinglon
is fastly the best service for caching?

~~~
pier25
I'm very happy with Cloudflare's workers so far.

You can store stuff in workers KV (sessions, images, complete static sites,
etc) even interact with their global cache with an API.

------
jordache
if they're just referencing youtube videos, what is there to scale up?
Speedier downloads of static content and repeat visits from likely the
relatively stable set of user base?

~~~
dangoor
This question is not uncommon, so I really should write a blog post I can
refer to. I've got another comment in this thread about this:
[https://news.ycombinator.com/item?id=23171877](https://news.ycombinator.com/item?id=23171877)

------
throwawaysea
Is there a good alternative video host or platform to YouTube? I always worry
about their fickle content management/censorship practices, and also just
don't like the idea of massive centralization around Google.

~~~
a_imho
What is the use case, does it need to be free as in beer? Cloudflare Streaming
starts at $5/month for 1000 minutes stored and $1/1000 minutes streamed.
Though I had to admit I had performance problems with their built in analytics
API (and their support was unable/unwilling to look into it). Sure there are
plenty around.

------
jliptzin
2.5x in a week is news? I have worked on many things from viral apps, blogs
that get picked up by large news orgs, etc that need to scale sometimes 100x
or more in a day.

~~~
bdibs
1 to 100 is easier than 100M to 250M (made up numbers, but you get the point).

~~~
polote
it depends on what you have to scale, if this is just php rendering, this is
still relatively easy to do.

