
Building resilient services at Prime Video with chaos engineering - Garbage
https://aws.amazon.com/blogs/opensource/building-resilient-services-at-prime-video-with-chaos-engineering/
======
agarzenm
It is nice to see these posted. I wonder if the other engineering related
sources at AWS get as much attention.

They have (in my opinion) an extremely good library of articles in their so
called builders library.

For example these two below articles.

[https://aws.amazon.com/builders-library/avoiding-fallback-
in...](https://aws.amazon.com/builders-library/avoiding-fallback-in-
distributed-systems/) [https://aws.amazon.com/builders-library/leader-
election-in-d...](https://aws.amazon.com/builders-library/leader-election-in-
distributed-systems/)

These topics are extremely hard to solve from scratch yet they are distilled
pretty well in the above articles and they include a further reading section.

I would implore others to have a gander. I wish the same could be said for
their documentation.

------
stuaxo
Using the Prime video app on my Android TV it seems like some basics aren't
really attended to.

The aspect ratio on the thumbnail for film I watched was stretched the other
day, which is not the lack of attention to detail I'd expect from such a rich
company.

Apart from this, it just isn't a smooth experience to navigate or search in.
As usual in development nobody is paying any attention to responsiveness.

There is no reason this app shouldn't work as well as the youtube app in these
respects.

~~~
benbristow
Amazon products are never polished or fully fleshed out. You can spend a few
minutes on their flagship site and find weird design choices or basic CSS
mistakes.

A shame considering the amount of money they have. You'd expect a bit more,
but then again Amazon are in the business of scale, quantity over quality.

At least it keeps them from having a monopoly on literally everything.

~~~
simonebrunozzi
Uptime of essentially 100%. Hard to criticize when they accomplished something
so rare in tech.

~~~
SEJeff
_laughs in us-east-1_

~~~
simonebrunozzi
The whole point is to use multiple Availability Zones across Regions, to
protect your uptime from failures in single AZs or entire Regions. Amazon.com
does that.

~~~
SEJeff
And yet, even companies like pager duty, who have “one job” and hire many
competent engineers, still struggle with this:

[https://www.pagerduty.com/blog/outage-post-mortem-
june-14/](https://www.pagerduty.com/blog/outage-post-mortem-june-14/)

Note that pager duty now weathers Amazon outages beautifully, but the point is
that amazon doesn’t have 100% uptime and that properly designing distributed
systems is very hard.

------
dschuetz
I'm sure now that Amazon Prime Video won't ever get a decent UI, instead the
current will remain horrible, but resilient at least.

~~~
PeterStuer
It truly baffles me how one could devise an interface so bad as the PrimeVideo
UI. I have yet to discover the algorithm behind the near random search results
trows up, and autocomplete for 90% of the cases trows up something 'that is
not available in your region'. It's 2020. How hard can it be to serve a small
catalogue of available titles and allow some decent parameterized querying?

~~~
fxtentacle
I also always wonder how companies can simultaneously offer to sell their
highly advanced AI-bla search, yet utterly fail at searching themselves.

I still remember the time when Google tried to license their search servers to
enterprises. It appears that market has now been completely eaten by Algolia.
And my hunch is that it's because Google's search results are completely
irrelevant for professional users. I search for a Windows API function name,
and I get pages of SEO spam trying to sell me an unrelated Udemy beginner's
course.

------
fxtentacle
Sounds like cloud is getting ever closer to dedicated servers.

One would hope that for the 10x price increase over bare metal, someone would
abstract away things like the underlying hardware or OS failing. But
apparently, no, you have to pay the premium and do all the work.

The only use-cases that I could imagine for such a tool on EC2 would be if you
either don't use containers, or if you oversubscribe your virtual servers by
having higher container limits than what the instance can endure.

In the first case, the proper fix is to use containers. Docker can do CPU
limiting for you so that one service spiking won't affect its neighbors on the
same instance.

In the second case, I'd go bare metal and then hardware is so cheap that
there's very little temptation to oversubscribe on RAM or CPU.

~~~
neo01124
Hi!

Author of the article here.

The core concern is not about the capabilities of the compute abstraction
being used (bare metal, containers or functions) or testing OS capabilities.
The aim is to validate mitigations which are in place to counter turbulent
scenarios (For example: massive spike in traffic, network outage, dependency
is down, etc). These scenarios generally originate outside the given system.

These kind of questions should be asked and systematically validated (quoting
the article):

* Have you tested how the system behaves when the underlying instances have a sustained CPU spike?

* Is the system behavior understood under different stress?

* Is there sufficient monitoring?

* Have the alarms been validated?

* Are there any countermeasures implemented? For example, is auto-scaling set up, and does it behave as expected? Are timeouts and retries appropriate?

~~~
fxtentacle
I believe we just have a rather different approach here.

"Have you tested how the system behaves when the underlying instances have a
sustained CPU spike?"

Since dedicated boxes are cheap, I'd just buy 5x the CPU resources that I
reasonably need and call it a day. If there ever is a more than 5x traffic
spike, then docker will prevent it from being a noisy neighbor, so the
affected services will just become slower than usual. But even a 10x traffic
multiplier would just produce a 2x slowdown, which should be tolerable for
most users.

I agree that on clouds you want to save costs by only booking what you need.
But bare metal, you can usually afford to keep spare capacity around all the
time.

As such, I wouldn't plan for the system to behave well under stress. I'd try
to always have enough resources around so that stress never happens. At the
end of the day, this seems like a developer time vs. resource costs trade-off
and for most companies, developers are sparse and resources are plentiful, so
they'll have a very different trade-off from big FAANG companies.

"For example, is auto-scaling set up, and does it behave as expected?"

If your system is usually 90% idle, I wonder if you'll ever need that auto-
scaling. Also, I'd say my customers can endure it if page load time goes up
from 100ms to 200ms. So in my opinion, there is little need for auto-scaling
for most companies.

~~~
ses1984
>"Have you tested how the system behaves when the underlying instances have a
sustained CPU spike?"

You didn't really address this question, you addressed a different question,
which is a traffic spike.

>Also, I'd say my customers can endure it if page load time goes up from 100ms
to 200ms. So in my opinion, there is little need for auto-scaling for most
companies.

100ms to 200ms average? What about the tail? Your app might go from P99 -
500ms to P95 - timeout. That's when you'll lose customers.

~~~
fxtentacle
If the underlying hardware is a bare metal server, it won't magically turn
slow and have a CPU spike. That problem is caused noisy neighbor and kind of
exclusive to clouds.

Well, with the 2x example, my app might get from a 1s P99 to a 2s P99 which
feels slow, but is still doable. Again, those timeouts are usually introduced
by cloud infrastructure. For example, if you use nginx outside of Heroku, it
won't have a 30s timeout for file downloads.

~~~
ses1984
Your own instances can have an unexpected CPU spike.

Even if you're running on bare metal I find it hard to believe you don't have
a layer with short timeouts between your front and backend.

~~~
fxtentacle
Why would I? I have redundant 1GBit LAN cables between front end, back end,
and database servers.

~~~
ses1984
Because it's bad ux for your users to see a spinning loading icon forever.

~~~
fxtentacle
And a timeout error would be better?

~~~
ses1984
In my experience, yes, a lot better.

------
BiteCode_dev
Is this a case of "netflix did those articles, so we'll do the same so geeks
like us too"?

~~~
danellis
It does seem like a fad. Kudos to those who can create a new field and profit
from it. On the one hand, "chaos engineering" seem a bit like "we don't
understand our architecture well enough to know what its failure modes are, so
let's just poke it and see what happens," but on the other hand, it seems at
least a little bit analogous to fuzzing, which is certainly a technique that
yields useful results that would have otherwise been overlooked until it was
too late.

~~~
mohave529
My first instinct was to agree with this, but from my experience it's
extremely difficult to properly communicate failure modes 100% of the time
across different teams in very large organizations. Dependencies that are
fuzzy arise for example when a service A proxies data for client service B
from some other service C. It doesn't help that the organization of teams in a
company often severs lines of communication between teams who explicitly don't
have dependencies but implicitly do. As a result, information gets lost in the
process. Having a last line of defense in the form of a "chaos engineering"
team may actually be the natural response of large organizations to counter
the inherent messiness that is produced as a result of bureaucracy.

~~~
lhoff
That and additionally it has implications for the development team as well.
Using "chaos engineering" shifts the mindset of the developers. As a developer
you now expect things to fail. You know that the "we make it work first and
make it resilient later" approach will bite you sooner then later so you think
resilience from the first line of code.

------
naringas
what about disabling the short advertisement preview played before what I
actually want to watch?

sure I can skip it, but why should I have to?

------
neo01124
Author of the article here.

Please take a look at the underlying library here (AWSSSMChaosRunner) -
[https://github.com/amzn/awsssmchaosrunner](https://github.com/amzn/awsssmchaosrunner)

------
fatninja
We tried a similar chaos tool in our company built in-house. Simulated most of
the scenarios mentioned here using SSM/other scripts. At first everyone was
interested and after some time the interest faded. Our problem was lack of
visualization across the app ecosystem i.e how will it impact the app
ecosystem when a batch of ec2 instances are suddenly spiking on CPU and what
will be the impact to end user.

Turns out people care only if there is an end user impact and doesn't really
care about random anomalies.

And to build the capabilities required for measuring the impact + automating
the workflow of the actual chaos tests is a lot of work

~~~
neo01124
Stress testing a whole app ecosystem end-end and preventing/mitigating end
user impact is generally a part of "gamedays" \-
[https://wa.aws.amazon.com/wat.concept.gameday.en.html](https://wa.aws.amazon.com/wat.concept.gameday.en.html).

A library like AWSSSMChaosRunner would be a core component of building gameday
like capability. But building a full gameday framework is out of the scope of
this discussion.

------
steve_gh
You can do Chaos Engineering quite easily with Ruby, because you can raise an
exception in a thread from another thread. Many years ago I built a simple
tool which allowed you to specify a series of exceptions, and their frequency.
The library would run up an extra thread in your process, and simply drop
bombs (i.e. raise exceptions according to the required distribution) across
the other threads at random.

It worked a treat for ensuring high availability in IoT systems

------
_xerces_
Always start out streaming in really, really low quality on the Firestick and
gets stuck there despite us having gigabit fiber. Usually have to stop
playing, exit and then go back in again then it works and plays in HD. Never
happens on Netflix or Hulu or any other app, just Prime Video.

------
raverbashing
One thing I wonder (and find difficult to simulate) are failures in external
services. Sure you can unit test your function with a 500 for example, but you
never know in which ways the function/library can fail

(Not to mention the cases where it says everything worked but it didn't)

~~~
neo01124
The article does talks about how to inject latency or packet-loss into calls
to particular external services. This should help you test many
service->dependency failure scenarios around retries, timeouts and circuit-
breakers.

Injecting specific error codes or exceptions is a bit more complicated but it
is possible with other approaches, for example: Chaos toolkit.

------
what_ever
I have been getting failed to load error almost every week on Prime Video when
I try to load a series at dinner time on weekdays (pacific timezone). Never
have this issue with Netflix.

------
Havoc
>The key to chaos engineering is injecting failure in a controlled manner.

Doesn’t that sorta defeat the point of “chaos” a bit?

~~~
neo01124
That is more about the "engineering" bit.

------
swayamraina
This mentions it cannot be used against AWS lambda. Not sure why?

~~~
neo01124
Hi! Author of the article here.

The AWSSMChaosRunner approach can't be used for Lambda because of what @vasco
said.

You can take a look at a different approach here for failure injection in
Lambda - [https://medium.com/@adhorn/failure-injection-gain-
confidence...](https://medium.com/@adhorn/failure-injection-gain-confidence-
in-your-serverless-application-ce6c0060f586)

------
rootedbox
If I use any prime video apps.. they think I'm in Canada(which I'm not).. and
only gives me Canadian selections(which are horrible)..

if I use prime video in a browser.. it works fine..

I have no idea why it does this and support has no clue.

------
chromedev
Just remember anything you purchase on Amazon Prime, you don't technically own
it.

~~~
nkristoffersen
From my understanding you don’t technically own most media (music, movies,
software) regardless of the format. It’s a license. Even if you purchase a CD
or DVD, etc.

~~~
kortilla
That misses the point. I have DVDs that I purchased that will work as long as
I posses them and they were made before Netflix was even a company.

There may technically be a license attached to them, but there is no practical
way for any company to revoke my usage of them and the failure mode if any of
those companies cease to exist is for them to continue working.

~~~
paxys
Isn't it exactly the same for a song or movie you purchased from Amazon and
downloaded to your hard drive?

~~~
chrisco255
Can you download prime movies to disk? I thought it was streamed only.

