Hacker News new | past | comments | ask | show | jobs | submit login
Prime Video service dumps microservices, cuts AWS bill 90% (thestack.technology)
140 points by msolujic on May 7, 2023 | hide | past | favorite | 41 comments



Here’s a link to the actual blog post from Amazon: https://www.primevideotech.com/video-streaming/scaling-up-th...

It’s worth noting that the team that wrote the article is just a small part of Prime Video. The headline can definitely be interpreted to mean the entire Prime Video service.


I feel that the term "monolith" being in the title has led everyone to draw the wrong conclusions.



I ended up having a weird debate on Reddit over this. For some reason people seem to think that there is a single better choice and that if you ever go from a monolith to microservices that the monolith failed or whatever.

One person even made the claim that microservices are easier to maintain and debug even tho a single code base that you can use tools on and easily keep in sync seems miles easier to maintain than microservices where you need to version your interfaces so older services continue to work while you upgrade newer versions.

Then they were talking about with a monolith it takes longer to deploy. I assumed their actual deploy tasks took longer, which I can kind of see. But it turns out they were talking about how often they deploy.

It seems nuts to me that people think switching to microservices from a monolith will suddenly mean their poor practices that got their monolith into a bad state will suddenly disappear. From my experience, it generally just means they develop a distributed monolith and have all the same problems with some extra problems.

I'm of the opinion that if you can't develop a good monolith system you can't develop a good microservices system. A good microservices system while it has it's benefits is, in my opinion, harder to architect, develop, and maintain than a microservices based system.


Funnily enough, both things can be true depending on the context.

Microservices are easier to debug: sure, if you compare a single monolithic binary with no logs and trace points to a microservice system that runs on top of a fancy trace the heck out of everything service mesh. Logging is still hard.

Deployment time: the same reason why people wouldn't compile dependencies statically. It's easier to replace a .dll or .so than the entire application, less bandwith. However, mess up the versions and you end up having a really bad day. Same with microservices: sure, a tiny service is easy to deploy, but heaven forbid you deploy incompatible versions.

etc etc.

Microservices give people who don't have an architectural view/experience in building complex systems from scratch a framework to put their stuff in without thinking much about it. Good luck getting the same level of organization in a monolithic NodeJS application. It's hard, even seasoned veterans often get the abstraction layers wrong.


I think part of the problem is that people who are wildly enthusiastic about microservices haven't experienced DLL Hell and don't understand the problems that can happen.


No argument here, I personally don't think microservies are the solution to code organization problems. However, many ecosystems don't have a good structure that lets less experienced coders work efficiently on large-scale systems (NodeJS, Go, etc.)


> Deployment time: the same reason why people wouldn't compile dependencies statically. It's easier to replace a .dll or .so than the entire application, less bandwith. However, mess up the versions and you end up having a really bad day. Same with microservices: sure, a tiny service is easy to deploy, but heaven forbid you deploy incompatible versions.

That argument I could understand. But their argument turned out to be that they thought a monolith meant everyone has to spend weeks between deployments.

So it wasn't how long it took do the deployment but how often they could do deployments. Which is entirely a process matter. With microservices in my experience it can turn in to a mess of we need to sync our deployments because they built a distributed monolith.

To be honest, most of their arguments seemed to be they had tons of technical debt and blamed that it was a monolith for the technical debt existing than the fact it's technical debt.

> Microservices give people who don't have an architectural view/experience in building complex systems from scratch a framework to put their stuff in without thinking much about it. Good luck getting the same level of organization in a monolithic NodeJS application. It's hard, even seasoned veterans often get the abstraction layers wrong.

If you don't have the discipline to keep things organised in a monolith how are you going to manage with microservices? Especially, when there are literally tools to help you do this in monoliths. This is how people end up with distributed monoliths.

Is building a high quality monolith hard? Yes. Is it harder than building a high quality microservices system, I would say no.


Monoliths have local CPU L1-Cache speeds, distributed microservices have JSON desearlization,searlization, network round trips.

Distributed stateless monoliths can be faster than microservice.

Carbon foot print, cloud bill of too many microservices is probably high.


> Monoliths have local CPU L1-Cache speeds

This becomes particularly differentiating when you start looking at batching primitives. Getting that pipeline filled up can add another 10x+ to performance.

Stacking all of the multipliers in your favor, things do begin to look pretty silly for distributed crap: Moving from a network trip in the same datacenter to an L1 reference is about a million times faster. That is pretty direct terms - you could hypothetically claim your app is ~6 orders of magnitude faster than some competing solution if you do everything right and you aren't saturated on I/O.

All of this said, I am feverishly pushing most of our B2B product stack into "serverless" functions. We don't need 6+ orders of magnitude performance improvement for our business. We need compliance, standardization and stability above all else. We want to be able to point our fingers at someone else and say "you fix it".

Running everything in 1 process is something I have been very enthusiastic about so far, but it does have its downsides if raw performance (or simplicity) is not your primary objective. We are trying to sell our product up-market, and without the ability to say things like "our production database is N+3/multi-zone, uses TDE and the keys are stored in an HSM on Mars", we won't get much attention from those customers.


This is especially true when you've got yourself a distributed monolith architecture.

However, the case for monolith architecture to make a favourable comeback based on factors such as performance and operational cost are all negated when you have multiple teams working on a single monolith. It quickly because a nightmare to manage all those code changes, and then you have manage to releases etc.

Most companies are doing microservices wrong, and it gets expensive fast.Thats not to say microservices are the inferior choice, because for most large orgs they are not.

Are we really going to see this debate pop up again, just because amazon did something?


> distributed microservices have JSON desearlization,searlization,

It doesn't have to be JSON, protobuf is 10x faster. Still slower than in-memory/CPU cache of course.


Carbon foot print vs “engineering time costs more than computing time”


Microservices never decrease engineering time costs.

Project manager overhead - maybe, but ideally that should be zero anyways.


This sounds like a bad first design:

- didn’t just grab the whole video (or ~1min segment), but individual frames

- put each detector in its own context, needing its own copy

- created a StepFunction for each tiny piece

This never would’ve passed design review when I worked at Amazon because the combinatorial blowout is obvious and predictable.

I’m sure they switched to what was considered good practice even then:

- a single StepFunction per video

- a batch step across segments

- downloading in bulk, to a single container

- running all the detectors together

The problem wasn’t serverless or micro services: this was just bad design.


This assumes the current functionality is the same as it was when it was originally created.


The article from Prime Video says that it has largely the same functionality, while outlining problems from the initial version.


But it also says it was never intended or designed to run at high scale.

They built something that solved the problem at the time. It worked. They outgrew it and evolved the architecture.

Now everyone’s on HN and Twitter claiming to be experts saying they would never do that and they know better.


> But it also says it was never intended or designed to run at high scale.

Doesn't this statement itself indicate bad design? I mean, it's Amazon Prime Video. If you design a system for Prime Video you should design it to run at high scale.


No. This is one component of Prime which is no longer a microservice. Indicating they still heavily use microservices.

They needed to solve a problem. It was probably a low priority thing they wanted, didn’t need it to scale as or be highly used. They designed and built it quickly.

It outgrew the architecture.

But during that time it solved the problem and made them money.

The requirement of software is to solve a problem. A second requirement is often to make money. It ticked both boxes.

That does not mean it’s bad design. This is something juniors and intermediate programmers forget. That something does not need to be designed perfectly on day one. It needs to work.


> A second requirement is often to make money. It ticked both boxes.

In this case it caused 10x higher AWS bills when compared to proper design, so I'm not sure if it ticked the second box.

If you know that your product will absolutely outgrow the architecture soon (again, we are talking about Prime Video) it makes more sense to design it properly from the beginning.

> They designed and built it quickly.

> That something does not need to be designed perfectly on day one. It needs to work.

But they tried to design it perfectly on day one, with the wrong assumptions. A monolith would be easier to design and quicker to implement. Again, I am speaking for this special case (because of the scale of the company/product). This is not premature optimisation, it is common sense.


If this was a system design interview, would this design be approved?


I’m guessing about their requirements.

They decided to go the extra step and decode locally to save the S3 bucket entirely — so maybe not, as my change still has that (wasteful) step.

But the questions “can we group these operations together to save on network?” and “will we hit API limits due to fanout?” are ones I’ve had (in other contexts) during design reviews. I think it’s weird they missed that, initially.

Fanning a video out per frame and per operation seems inherently problematic; batching seems the obvious answer. Design review is meant to address that (variety of) concern.


Article's headline and text mention this was achieved by switching from Lambda to ECS, which makes a lot of sense. Lambda is expensive for frequently used services.


If i recall one of the biggest selling point of lambda (serverless) was the “infinite” scale. Sounds like the “cash cow” for cloud service providers when used in efficiently/inappropriately.

Besides the lower maintenance efforts.


> If i recall one of the biggest selling point of lambda (serverless) was the “infinite” scale

From zero to "infinite", so very useful for highly variable or completely unknown workloads. Especially with the container version migrating away isn't a massive undertaking, so it's still very useful as a starting point.


Thats true. I think the example from AWS is some once in a quarter data ingestion being much, much cheaper on lambda, and of course if you have your own build machines then you are limited to how often you can build without having people waiting.

But if you want low maintenance efforts, then you can get managed Kubernetes.


Sorry but managed Kubernetes is not low maintenance like Faas. I had a kubernetes upgrade fail and it was not fun to stand up a new cluster install newer versions of the ingress that could not longer be upgraded, migrate all the workloads. The k8s cluster is still very much your problem


This is a weird article. They moved from Lambdas attached together with Step Functions to have their video monitoring service as its own program. Since some other service is serving video, handling billing, and all the other things Prime Video has to do,, there is no way that could be considered a monolith.

They've gone from serverless to a microservice.

(Which is interesting, but probably doesn't get the clicks.)


As I recall, https://icloudguru.com/ used serverless to serve their video lessons and said they were paying pennies.

I wonder what they're doing differently. I tried searching for the original article, but it's too ambiguous with their general offerings, so no luck.


Just serve and not transcode? Because it sounds like the Amazon Prime microservice was doing some heavy processing of uploaded video - breaking it into frames and applying some ML algorithms to them, and also the same with the audio. Sounds like even if you can fit in under the 10 min lamda execution timeout (assuming internal teams are limited to the same), that AWS Lambda/Step Functions isn't a generic job queue handler and is not infinitely parallel like AWS would want you to believe.

It's a bitl ike using an SQL database for streamed append-only logging. It'll work in test, with test amounts of data, but doesn't actually scale.


Writing every frame to S3 and reading them back in other steps was probably a major factor.


There is not much of an option with step functions, outputs/inputs can be max 256KB so especially if you go into or out of a map/parallel state you can‘t do much more than pass references to the data and not the data itself.


Sure but I guess I mean the nature of that task probably made them the wrong architectural choice.


Serverless isn't "cheap". It's rightsized. If you aren't serving 24/7, it makes sense to move to an on-demand model. However if you compare the unit of compute pricing, Lambda is comparable to other services, and likely more expensive than buying compute wholesale. You can just buy it in smaller portions.


>I wonder what they're doing differently

Largest thing that stands out to me is simply scale


A service should solve an entire problem and minimize the count of horizontal network calls. Serverless providers wants you to split things into single tasks but the overhead cost per request is high. A task is not a service.


Don't buy retail, buy wholesale, unless you are buying a small amount.


CPU’s are mind boggling fast compared to data transfer. There are exceptions but for the vast majority of workloads reducing data transfers is the easiest optimization.

But scale is the obvious key component, having CPU’s allocated to doing mostly nothing is such a waste.


Well, except that's not necessarily an "easy" optimization. If you go all-out with the highly managed stuff it's kind of like plugging together very simple Legos.


Agree, east was a poor choice of word. Perhaps powerful would be better because frequently it can be difficult to attain. But it is difficult to pick good nuances for broad generalizations. There are many other attractive aspects of micro services and many cases for which they are perfectly well suited.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: