> But it also says it was never intended or designed to run at high scale.
Doesn't this statement itself indicate bad design? I mean, it's Amazon Prime Video. If you design a system for Prime Video you should design it to run at high scale.
No. This is one component of Prime which is no longer a microservice. Indicating they still heavily use microservices.
They needed to solve a problem. It was probably a low priority thing they wanted, didn’t need it to scale as or be highly used. They designed and built it quickly.
It outgrew the architecture.
But during that time it solved the problem and made them money.
The requirement of software is to solve a problem. A second requirement is often to make money. It ticked both boxes.
That does not mean it’s bad design. This is something juniors and intermediate programmers forget. That something does not need to be designed perfectly on day one. It needs to work.
> A second requirement is often to make money. It ticked both boxes.
In this case it caused 10x higher AWS bills when compared to proper design, so I'm not sure if it ticked the second box.
If you know that your product will absolutely outgrow the architecture soon (again, we are talking about Prime Video) it makes more sense to design it properly from the beginning.
> They designed and built it quickly.
> That something does not need to be designed perfectly on day one. It needs to work.
But they tried to design it perfectly on day one, with the wrong assumptions. A monolith would be easier to design and quicker to implement. Again, I am speaking for this special case (because of the scale of the company/product). This is not premature optimisation, it is common sense.
They decided to go the extra step and decode locally to save the S3 bucket entirely — so maybe not, as my change still has that (wasteful) step.
But the questions “can we group these operations together to save on network?” and “will we hit API limits due to fanout?” are ones I’ve had (in other contexts) during design reviews. I think it’s weird they missed that, initially.
Fanning a video out per frame and per operation seems inherently problematic; batching seems the obvious answer. Design review is meant to address that (variety of) concern.
- didn’t just grab the whole video (or ~1min segment), but individual frames
- put each detector in its own context, needing its own copy
- created a StepFunction for each tiny piece
This never would’ve passed design review when I worked at Amazon because the combinatorial blowout is obvious and predictable.
I’m sure they switched to what was considered good practice even then:
- a single StepFunction per video
- a batch step across segments
- downloading in bulk, to a single container
- running all the detectors together
The problem wasn’t serverless or micro services: this was just bad design.