My word. I'm sort of gob smacked this article exists.
I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.
There was an article not long ago from AWS saying they'll be focussing on cutting cost for customers. Maybe the next step of that process will be pushing their clients off of AWS and telling them to just host on prem.
I know you're joking around, but no, as they also explained a benefit of cloud (and therefore using AWS) is that it can scale flexibly with their customers' businesses.
If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.
With cloud if demand drops you can scale up and down as needed. Helping customers cut costs during difficult times makes sense since those customers are more likely to survive and stay with you through good times.
So in context I think this article makes sense since long-term sustainable growth of AWS should be linked with the growth of their customers' businesses.
> If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.
Cloud vendors also mostly sell minimum use packages for discounts in the range of 20 to 80% (called e.g. "committed use discount" or "compute savings plan"). Lots of businesses use those, because two-digit discounts are real money, but they might find themselves in the same spot as with physical hardware they don't need...
I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.
It also means some poor fuck at AWS gets woken up in the middle of the night instead of me when things go to shit.
It absolutely comes at a cost, and might not be the right fit for an organisation that's absolutely on top of it's hardware requirements and can afford to divert resources from new development work. For the rest of us it saves a lot of dev hours that would have otherwise been spent in pointless meetings or debating the best implementation of whatever half-baked stack has oozed it's way out of the organisation in an attempt to replicate what's handed to you with a cloud solution.
> I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.
And endless orgies of "call for pricing" with hardware vendors and hosting. Shitty websites where you can buy preconfigured servers somewhat cheaply, or vendor websites where you can configure everything but overpay. Useless sales-droids trying to "value-add" stuff on top.
Cloud buys are a lot friendlier, because you only have the one cloud vendor to worry about. Entry level you just pay list price by clicking a button. If you buy a lot, you are big enough to have your own business people to hammer out a rebate on list price, still very easy, still very simple. But overall still more expensive unfortunately.
> I'm a cloud proponent because it means not having to sit through hours of meetings to deploy a $5/mo virtual machine.
I'd hope there aren't actually hours of meetings for a single $5/mo VM?
But I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.
Spend is spend, it's vital to understand what is being spent on what and why.
> I'd hope there aren't actually hours of meetings for a single $5/mo VM?
Slightly exaggerated in the case of the $5 machine, probably 2-3 manhours total but it took 4 days for it to be deployed instead of ~5 minutes. We did spent tens of hours justifying why the business should spend ~$100 more per month on a production system where the metrics clearly indicated that it was resource constrained.
The same IT department that demanded we justify every penny spent did not apply any of that rigour to their own spending. Control over the deployment of resources was used as a political tool to increase their headcount.
> I would hope there are reviews and meetings when deploying enough of these to amount to real money. Companies that don't do that soon enough find themselves with a million dollar AWS bill without understanding what's going on.
I consider the judicious use of resources to be part of my job as a software engineer. A development team that isn't considering how they can reduce spend, tidy up, or right-size their resources is a massive red flag to me. Organisations frequently shoot themselves in the foot by shifting that responsibility away from the development team. The result is usually factional infighting and more meetings.
It's not really the same spot in that your paying monthly rather than upfront. Devs tend to think about total $, the business/accountants do care about Opex vs Capex.
Also it's going to be simpler to provision your base (commited use) on the cloud and then handle bursts on the cloud, than it is to have your base on prem and burst to the cloud.
> It's not really the same spot in that your paying monthly rather than upfront. Devs tend to think about total $, the business/accountants do care about Opex vs Capex.
You can buy physical servers in leasing ,turning it into opex
You can also rent them for little bit extra via managed dedicated servers from vendors like OVH.
And other point I also seen used to lie about cloud cost is saying you save so much on engineers.
...while forgetting to have sane on-call rotation for cloud you also need at least 3 people on that rotation that are also clued in on cloud operation enough. Sure they can be "developers" but if your app architecture requires so little maintenance and flea removal that they are not doing ops jobs much, chances are so would it in either rented or dedicated server env.
That is not really a difference, you may as well lease your server farm in the basement, practically the same cost as buying it, just as a monthly payment with the supposed "advantages" the business people might care about.
> If your business invests in physical servers anticipating strong growth next year then later finds out actually we're going into a recession and those servers are no longer needed, then that's a sunk cost.
Yes, but that sunk cost is probably still lower than what you paid AWS for the option to scale up and down.
This. And I think people tend not to understand how little actual hardware they are paying for when using AWS et al.
A really cheap server leasing deal will cost you yearly about as much as the purchase price of the server. With opaque AWS services it is probably more like a month of subscription to pay for the hardware that you are indirectly using.
I worked for a global company that maintained it's own "cloud" of VMs that we'd use for development purposes.
They were entirely unusable.
Opening a relatively small file in notepad could take multiple minutes. OS click and typing response times were measured in seconds.
Despite wasting thousands of developer hours each year, they refused to upgrade their data center. Probably because doing so would have been a major budget fight that requires an executive to actually advocate for something instead of making their characteristic animalistic grunts of agreement.
For better or worse I haven't seen the same issue with cloud expenditure. It seems to be perceived as a necessary expense, rather than the engineering department getting ideas above their station.
I just spent the better part of two years advocating, pushing, and fighting for months to add new bandwidth to our datacenter.
Thankfully after they understood the problem it only took 8 months of procurement, techs going to the data center 10+ times with endless screw ups, and everyone pointing the finger at each other.
While the cloud sucks in many ways the traditional setup has big problems as soon as you hit a midsize company ime.
A cloud vendor (who will be nameless as I signed an NDA specifically that prevents me from disparaging them; but one of the big three) ran out of capacity for me and it was 3 months before they managed to fix it. -- that was with a couple million a month in spend.
Cloud is still servers; you just depend on someone elses capacity management skills and you hope that there isn't a rush to populate a location (like when a region goes down and everyone's auto-provisioners move regions to yours)
Barring exceptional circumstances, I don't have to fight that fight at the cloud provider though. Their business is more likely to be amenable to maintaining and expanding reasonable levels of capacity.
I have to deal with a grumpy finance guy that thinks my whole department is overpaid already, especially so if we might use the dreaded `CapEx` word.
I think the main point here is that there is no limit to incompetence. And sure, having your own servers allow for some goofs that won't happen with cloud (the opposite is also true). But your org had the means to fix the issue, and they choose not to. That has fundamentally got nothing to do with technology choice.
Or maybe there's a mid point. It's not datacenter or cloud. There are providers offering physical servers for rent for example. Lots of combinations in-between.
And that you can also lease servers directly from vendors like OVH so you don't even need to bother with the "drive to datacenter and install it" part. It's more expensive but still far cheaper than cloud
Most companies will pay way more for the engineers maintaining their on-premises infrastructure than they would for AWS. On-premises still makes sense when you reach a certain scale. When you reach a certain scale.
They kind of need to be there anyway, physically maintaining servers turns out to be a miniscule part of the whole maintenance. If you really care about uptime you still need people on-call who can intervene as necessary.
It's not a minuscule part of a small company. I made the point that on-premises makes sense after a certain scale.
Once you have on-premises you need people that know switches, routers, rackmount server, hardware, virtualization, etc, plus keeping all of that properly maintained (security patches, IaC, periodic updates, analyzing performance, making sure it's properly architected, etc).
I often see people saying it's the same cost or less but it's really not. Unless you have no idea what you should be doing.
I don't know, I worked at a few companies that did this early in my career (early 2000s), and it was just the devs or the sysadmin of the office IT that did this sort of thing. There are lots of people who know enough about switches and routers to get them up and running.
Virtualization, IaC, analyzing performance, right architecture etc is all for later, when you've grown enough to need that.
> Virtualization, IaC, analyzing performance, right architecture etc is all for later, when you've grown enough to need that.
Yeah, I think it might be a different perspective about when that all should be done.
I tend to do that right from the beginning because I often see it snowball later on and nobody ever fixes it or does it "properly" (in my opinion, possibly not the right one).
Using insurance to cover unexpected costs is always a gamble, one way or an other. A business that invest into physical servers could sell of those servers if they later find out that there is a recession, which might cost more or less compared to a cloud solution.
If a business invest into a cloud infrastructure and create a binding contract for 5 years, only to find out that they actually want to abandon that project a year later, that is also a sunk cost. Long term contracts tend to be cheaper, so its a trade off between saving money vs risk.
It all depend on the risk analysis, how risk averse one want to be, and the economics/liquidity needs.
It depends on your timing. If you're extremely unlucky, you'll buy the new set of servers and the recession will hit right after you sign the PO. Probability says you're not likely gonna be that unlucky, so the recession will hit probably elsewhere in the physical servers' life cycle. A recession hits, and now there's a focus on cutting costs. With AWS, you don't have much choice - if you stop paying the bill, the servers evaporate into the cloud. Physical servers don't. You can change their replacement schedule and just wait arm few years more to replace them. Hopefully the recession has passed by then and you can buy a whole new pile of servers.
Really though, it seems like a hybrid on-prem/cloud approach is one to consider. Software like Anthos eases this, though there are also pitfalls with this approach too.
We have some DB servers that occasionally need to do very large batches of transactions. They run all month with a couple of CPU's, and a small amount of ram to make sure they are 'caught up' with production, and before the batches are run, get shutdown, and changed to 32 or 64 CPU monsters. An hour or two later, they go back to the 2 cpu servers again. In a non-cloud shop, we would have to size our hardware for that maximum batch size.
To be fair to AWS, they do work really hard to (at least at an account level) to optimize workloads with you. They do this so overall you'll move more workloads to them.
its quite simple, if workload x can be done 100% cheaper on-prem then its an obvious move (probably) if AWS manage to get that closer to 30-40% then the operational benefits of using AWS make more sense, more workloads, more total spend.
An easily tested compatible upgrade that gets a free performance boost… vs lots of engineering effort to rewrite… yeah that’s just not going to fly with management. Who are probably looking at the 25% performance boost as a 20% cost reduction not a 20% speed increase
The only problem with that is that Docker lambdas boot slower than lambdas with the built in runtime (not ridiculously slow, but could be 2x or something). God help you anyway if you’re trying to do something latency sensitive on Lambda, but if you are then you probably don’t want to add more time for a docker pull.
The situation does keep changing - AWS does optimize things.
I'm not so sure it's a black/white true/false. Depends on what goes in the docker image. It's something like for larger deployments docker is faster but for small deployments it's the other way.
We've actually observed the opposite at our company. Moving from a Python 3.8 built-in to Docker based changed our response times from about 40ms to 30ms on average.
> Which is good lesson, and a good story, but there’s a kind of irony it’s come from an internal Amazon team. As another poster commented, I wouldn’t be surprised if it’s taken down at some point.
Why? Using the model they switched to (which uses a different set of AWS services) instead of the model they switched from is a recommendation that the AWS tech advisers that are made available to enterprise customers will make for certain workloads.
Now when they do that, they can also point to this article as additional backing.
Have you had AWS tech advisers advise teams in your company to go with this stack? Because I haven't.
AWS doesn't have an equally distributed interest in selling all of its products. Some AWS products exist because customers need/demand them and others exist because they provide higher margins and tighter lock-in to Amazon: the first type of products are great for customer acquisition, the role of their sales folk is to then convince people using the former to migrate to the latter.
I've never knowingly had an AWS solutions architect recommend something because it would make AWS more money in the short term. The most frequent advice I've seen from them has been on how to give AWS less money by making use of different features, or changing how particular services are being used.
You sound very confident in your estimation of other peoples' motivations and skills. Do you imagine AWS solution architects coming to you directly with "you should pay for this service because it makes more money for AWS"?
I've done the AWS solutions architect associate level cert and I can tell you first-hand experience that in order to pass the exam you need to memorize a lot of AWS propaganda that was written primarily to optimize AWS profit, not to optimize customer satisfaction. How many of those solution architects take those materials with a grain of salt vs how many of them genuinely believe that crap, I don't know.
I guess it depends on what you mean by "AWS tech advisers" ... I've had two different AWS recommended partners advise us to go to this stack, one of the partners even tried to explain that despite more than doubling in costs we would "make it up in faster development".
Aurora is I think pretty simple to move away from, since it's just fully compatible Postgres or Mysql. We even use a local postgres for development purposes against an Aurora solution.
Nope. AWS makes it dead simple to move from RDS to Aurora by clicking a button. There's no way to move data from Aurora to RDS short of doing a SQL dump and reloading everything that way. I found this out when my previous employer was looking at moving from RDS to Aurora.
> > Aurora is I think pretty simple to move away from, since it's just fully compatible Postgres or Mysql. We even use a local postgres for development purposes against an Aurora solution.
> Nope. AWS makes it dead simple to move from RDS to Aurora by clicking a button. There's no way to move data from Aurora to RDS short of doing a SQL dump and reloading everything that way. I found this out when my previous employer was looking at moving from RDS to Aurora.
I got a bit of a chuckle out of this. There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?
I get that it's not as easy as a literal button click, but this isn't vendor lock in.
If you can't stream changes and have to take downtime for a migration then you effectively have vendor lock in if you are serious enough about your database.
Physical Replication should be something any database can offer given its something cross database migrations used decades ago with no problem.
I'm intentionally ignoring any of the sarcasm in your comment. The time needed for a db dump is always dependent upon the amount of data. This is true regardless of the db software or where it's running.
Sure, but you don’t really expect to swap databases with a huge data store without any downtime, do you? I’m not aware of any technology that makes that easy.
If you were the CEO and your engineer and someone said, "2 minutes of actual work and a lot of waiting around due to the limitations of the hardware", would you interpret that as "2 minutes of engineering time"?
Obviously I'm going to spend a lot more time communicating the details of the situation to the CEO of a company that is paying me, than I'm going to spend communicating in a Hacker News comment. But as it turns out, no amount of communication is going to be effective if people don't bother to read past the first opportunity they see to jump in with a correction, even if that means stopping reading mid-sentence.
In that situation I would guess the one-click tool doesn’t really handle everything you’d need either so I don’t get what the point of the comparison is.
Nor am I but if we go back up the claim was that they were making it much easier to get in than get out, but unless you're telling me that the one-click tool somehow solves all the issues of migrating a large production database that's in use the difference between the two is a minuscule amount of active work.
I'm not sure what I was supposed to take away from your cryptic sentence. Is there a two minute solution to this problem that you are smugly keeping to yourself so you can mock people replying to you?
My guy, read this whole sentence, which has remained unchanged this entire conversation:
"There's no way to move from Aurora to RDS short of... 2 minutes of actual work and a lot of waiting around due to the limitations of the hardware?"
You seem to be having trouble getting past the word "and", so I've helpfully italicized the part you've repeatedly missed or ignored.
Now sure, that's a bit vague, but if you want more details it might have been advisable to ask a question rather than simply ignoring half the sentence because you don't understand it and jumping in with a correction.
And honestly, even if it's vague on some details, there's no universe in which "2 minutes and a lot of waiting around" = "2 minutes". Whatever vagueness you might accuse me of, that fact isn't vague.
that is fair answer! AFAIK, there are two things that consider:
* Aurora do have some vendor-locking feature if I'm not wrong?
* moving from Aurora to PostGres will lead to downtime of 2 minutes + unknown waiting, where this is not a case when you convert from PostGres to Aurora.
I haven't failed to consider that, you've just failed to read where I explicitly mentioned that.
Consider: is that impracticality caused by Amazon creating vendor lock-in? Or is that impracticality caused by the fact that reading terabytes of data from storage, transferring it over the network, and writing it into storage is inherently slow because of the physical limitations of hardware, no matter what vendor you're using?
It's a bit odd for me to be in the position of defending Amazon here. I genuinely don't like them, don't use them, and generally do think they're guilty of creating a lot of vendor lock-in. But this is legitimately not an example of any of that.
You're really claiming you can read my mind to know what I have and have not considered right now, despite me bringing it up specifically. Alrighty then.
As is, I've no responsibility to show you anything, and you're just making unwarranted assumptions about what I have and haven't considered, based on a pretty selective reading of what I've said.
Yes, still "2 minutes of actual work and a lot of waiting around due to the limitations of the hardware" which is what I said. Perhaps try reading the whole sentence next time?
If you talk about no vendor lock in, and you'd want to take your database then to a competitor, like Google Cloud, Azure, or on-premises, wouldn't you exactly expect to do a SQL dump and reloading everything? To me, the one-click move from RDS to Aurora you describe is a nice shortcut, but it doesn't invalidate that you can still do the former if you wanted to move to the competitor. Vendor lock in seems more that you've architected your application against a system that only exists on AWS, like SQS or S3 (although, I guess, competitors offer compatible APIs for some of those, I'm not entirely read up on the state of things there).
Seems a bit of a dark pattern to have shortcut to onboard and not offer the same shortcut to offboard.
Just like easy subscribe-online publications that will have you call during 2 hours with a rep pushing you discounts or whatever to cancel such subscription.
I'd characterize this asymmetry (convenient shortcut IN, standard export OUT) as a predictable, transparent, minor annoyance -- not a "dark pattern" representing deceptive or unethical practices.
Huh. I suppose you can. The pricing looks a little opaque though, and the fact that it took 3 hours after I wrote my comment for yours to show up kind of implies it's a bit of an obscure service.
I will also mention that the AWS team we were working with on this didn't mention DMS, and, when directly asked, literally told me there was no easy way to do an Aurora -> RDS migration.
Aurora significantly modifies the internals of the DBs, particularly the storage layers. It also makes large changes to how memory is used for Postgres. Query plans can be quite different than with the vanilla version. Once you tune and create indexes based on Aurora's characteristics it's going to be a pain to retune for the unmodified version. Aurora also introduces nasty bugs that don't exist on the RDS version such as a memory leak I found was periodically restarting our master. The Postgres team produces highly reliable code, but I don't trust Aurora's hacks on top of it.
I feel like it’s an object lesson in using the right solution for a problem. Step functions do not appear to me to be something that you’d use for things that need to be executed multiple times per second.
Having occasionally looked at them for workflow driven tasks I'm not sure what the use case for Step Functions is, unless your workflow being called once an hour or something they seem infeasibly expensive for what they offer, and somehow manage to be more complex than just writing some code to model the workflow.
It's a BPM product. If you have a highly regulated workflow that has to be changed a lot by multiple parties, these products start to make sense. The AWS step functions aren't that great in the BPM and workflow automation either, but I imagine AWS just wants to have a first party offering.
I'm pretty happy with the monolith that we run at our business and this seems to validate our decision to stick to that monolith, but I'm also pretty confident that where we use AWS Lambda, serverless is absolutely the right way to go.
For example, I've written a Lambda application to reply to webhook calls and send API calls whenever those come in. It costs maybe $2 per month to run in compute and requests. Would that make more sense to rewrite as a monolith and run on EC2? I really doubt it.
In your example, you compare Lambda against a separate monolith for handling the webhooks, but with a monolith wouldn't the comparison be between lambda and just adding a route and controller (or equivalent) to the monolith?
I'm more thinking of rewriting the application (a bunch of Lambda functions + API gateway + random bits and bobs) as a monolith and running that separately on an EC2 instance (or any other VPS).
In this article, they didn't bolt the serverless architecture onto another existing monolith, but rather rewrote the Step Functions and Lambda functions to be a single ECS task.
But background tasks are a thing. You could add a webhook endpoint to your monolith that writes a background job. Then your background worker (running on the same ec2 because it’s hardware requirements look pretty low at $2 a month in lambda) runs the job. $2 a month is now $0 since it’s running next to your monolith on the VM you’re already paying for.
It really depends on the quantity & scale though right?
If in my entire estate I have a single shell script that I run - wow lambdas / serverless are amazing.
When I have 200 things that cost $5/mo each to run but fit nicely on a single 8core/32gb ram server.. then this lambda stuff starts to seem crazy expensive right?
For some Alexa integrations it is neat and convenient, but I went back to hosting such small interfaces as a service on another server. Not an EC2 instance, just another hosted unmanaged server. They are as cheap as they can get right now.
I'm not really interested in the operational overhead that brings for these small services. Cost-wise they might be just about neck-and-neck, but at least I don't need to worry about the server going down, or having outdated software. Lambda gives me scaling, load balancing and redundancy for that $2.
From experience I say that the operational overhead to host on Amazon isn't trivial at all.
I like AWS and I would still recommend it. It saves some work but also creates new stuff to do. Especially if you also want the costs be manageable. Automatic updates, configuring a firewall + reverse proxy with automatic certificate renewals and you favorite deployment mechanism isn't more complicated or labor intensive than managing a small application with AWS. You need to interface it just like software you run on your server.
One of the services I host needed to be authenticated by IP. Happens. You easily get a static IP on AWS for incoming traffic. No problem and cheap too. Now try to get one for the other direction... Possible too, but maintenance just became at least as labor intensive as hosting your own machine. AWS just has to fit your scenario and I think many people overestimate how comparatively easier it has become to host a server with feasible security today. Chances are your databases would be less public than if you skip the AWS documentation.
I think you might be unintentionally arguing with a strawman, as everyone else here is talking about using monoliths instead of that.
Few people want to administer a bunch of micro services themselves, but running a single service on a box is pretty low effort, even if you duplicate it for fail over/redundancy
By "using monoliths", do you mean bolting all code you write into a single runtime, even if they are not the same service? Because that's not what was in this article. Instead, they took Step Functions and a bunch of Lambda functions, and created a brand new monolith from that.
In software engineering, a monolithic application describes a single-tiered software application in which the user interface and data access code are combined into a single program from a single platform.
"mono" stands for one/alone/singular, so monolithic is kinda defined to be exactly that, yes.
You can still have multiple monoliths, but they wouldn't communicate with each other and would be entirely separate applications.
I think it is fine. There are scenarios were you need distributed and there are scenarios that you don't.
IMO, distributed software is more practical for working development than for technical reasons.
We all know from basic stuff that performing software comes from single structures that does not require packing and unpacking data
But scaling large applications is hard, and it was much more expensive back then.
Now that we overreacted to microservices we will overreact to monoliths again. And we will bounce many more times until AI take our jobs and do the loop itself
The cynic in me (so like 93% of me) reads this as a "Instead of abandoning AWS altogether, we changed how we use AWS, but most importantly we're still on AWS"
As an exaws senior dude we never looked at our service stack as a sell at any cost, but as a continuum of service offerings that could be assembled to be more cost optimal at higher operational burden to (mostly) ops free at a higher premium. The goal was to provide a lego kit of power tools and disappear from view tools. At least in my org we never tried to upsell or convince customers of architectures that accreted revenues at their expense, we tried to honestly assess their sophistication and desire for ops burden and complexity vs cost savings by building it themselves with the lower level kit. By our measure using aws brought us business, and we were generally more motivated by customer obsession over soaking them. I know Andy definitely had that view and drilled it into our collective heads. In many ways as an engineering minded person I appreciated the sentiment as I enjoy solving problems more than screwing people out of their money for sport.
Exactly. You could easily frame it as "if AWS seems expensive, you're using it wrong". That an internal team could get it so wrong is testament to how difficult it is to get right, but of course, there's a consultant for helping with that.
The smoking gun is probably the box that was previously labelled "Media Conversion Service" (Elemental MediaConvert - easily 5-6 figures/mo. for a small amount of snappy on-demand capacity, or crippled slow-as-molasses reserved queues) now labelled "Media Converter" running on ECS. For example, vt1 instances are <$200/mo. spot and each instance packs enough transcode to power a small galaxy, for fine-grained tuning an equivalent CPU-only transcode solution isn't that much more expensive either.
At some point the industry will wake up to the fact the AWS pricing pages are the real API docs, meanwhile dumb shit like this will keep happening over and over again, and AWS absolutely are not to blame for it, any more than e.g. a vendor of cabling is guilty of burning down the house of someone who plugged 10 electric heaters into a chain of double-gang power extension cords
Exactly right. Most cloud victims are people who have faith instead of cost calculations. DHH & co. are the prime example. It seems even Amazon has such people. I guess hiring is much harder nowadays.
That was my reaction too. I know Microservices doesn’t equal cloud, but putting a big monolith on a big server is tangential to AWS interests to say the least!
> but there's a kind of irony it's come from an internal Amazon team
Not at all. My time working with AWS reps, they never pushed a particular way of doing things. Rather, they tried to make what we wanted to do easier. And the caveat was always to test and make decisions on what was important to us. This isn't an anti-AWS article. Rather, it's exactly the type of thing I'd expect from them. Use the right tool for the right job.
>Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis.
Tldr build the right thing.
>"AWS sales and support teams continue to spend much of their time helping customers optimize AWS spend so they can weather this uncertain economy," Brian Olsavsky, Amazon's finance chief, said on a conference call with analysts.[0]
Amazon isn't afraid of this trend, they're embracing it. Better to cannibalise yourself than be disrupted by someone else
Around 2008 the idea of microseconds were looked down on, until they weren’t.
The key is to look down on nothing, become competent with multiple architects and know which ones not to implement in a use case if the one to use isn’t clear right away
I don't read it like that at all. Both solutions use the Amazon cloud. Only in one solution you distribute a lot of processes, just because it's possible, and easy to code. When they figured out that rampant distribution was costly, they put more thinking in keeping a lot of computation in the same place (so, "monolith", but still in the cloud). No surprise, they found great savings. If they hadn't, they wold not have written about it. But they had to put some (most likely major) effort into redesigning the application.
Probably an unpopular take and my experience is almost 10 years old, but I would be surprised to see the Amazon I worked at try to bury something like this. If the product isn't what the customer wants, it isn't what the customer wants - move on and build something the customer wants.
Yes agreed there were some funny business like not selling Chromecast, but the guiding principle was generally to make things customers want...
Do you think the Lambda team want people to use as many of their services as possible even when it's not actually appropriate and there are better architectures and approaches available? I doubt that. They probably understand that Lambda is a good service for some things and not for others, and using it as a part of deploying things to AWS is a great idea but using it where it doesn't fit makes all of AWS look bad (in particular, hard to use and expensive.)
> Do you think the Lambda team want people to use as many of their services as possible even when it's not actually appropriate and there are better architectures and approaches available?
They most definitely want to as that would most likely mean more money (and promotions) is flowing there.
This is simply incorrect. ECS doesn't cost anything other than what you're paying for the EC2 instances that you place your tasks on. Fargate does, but that's not what they're using.
I agree, my intuition would put it to 1% vs. 99% (difficult to quantify of course).
I haven't yet seen a project/product which would need microservice architecture for technical reasons. If you need to scale, you can just scale monoliths (perhaps serving in different roles).
The use case for microservice architecture is IMHO an organizational / high level architecture driven. I've worked in a big company (20K employees) which was completely redesigning its back-office IT solution which ended up as a mesh of various microservices serving various needs (typically consumed by purpose built frontends), worked on by different teams. There monolith didn't make sense, because there was no single purpose, no single product.
But if I'm building a product, I will choose monolith every time. Maaaaybe in some very special cases, I will build some auxiliary services serving the monolith, but there needs to be a very good reason to do so.
I built a little microservice on the side of my monolith for PDF creation. It used headless chrome and ghostscript to render html to a nice PDF. The problem I had with having that code inside the monolith was that it increased my docker image creation for deploys by a lot. And that code pretty much never changed anyway.
I did feel a bit embarrassed having to make a microservice after having argued against them so much over the years. Hopefully I can stop producing PDFs soon so I can delete the entire thing :P
That honestly seems like a reasonable use case for (what I call) auxiliary services serving the monolith. As I imagine, there's no real business logic, no data storage / transactions, no authentication/authorization (besides the service being hidden in the private network probably).
That's exactly what my team did at a former company. We generated reports through a legacy document engine because some customers cannot switch/update their report templates and so we moved the logic out of the monolith into a service to get rid of a large portion of our dependencies.
Moved the monolith to .NET Core, kept the report service on .NET Framework. A win for everybody.
I feel like it's a semantics thing. To me the meaning of microservices has undergone semantic drift to the antipattern in the article where every little component or database table is its own service with an associated "pizza team".
It's fine to just have plain "services" to do things like this where you need to leverage another OS/framework/whatever and just hive off something like PDF conversion while your core application remains a monolith.
This is absolutely the way to go in my experience. Keep related functionality together, that'll probably result in a big monolith, with maybe a few smaller services orbiting it with very specific roles, or dramatically different traffic patterns. One project I worked on consisted of two monoliths, because we were at the intersection of two business domains, and it didn't make sense to attempt to slap those radically different concepts into one model.
You can produce multiple docker images from the same codebase quite easily. You can deploy and scale them separately. None of that requires separate repositories or expensive RPC instead of local function calls.
What's the difference if they are in one or two repo if they produce two artifacts that are separated? You *will* have network calls between the two, unless you are marrying yourself with a deployment/operational platform that can run the two artifacts together. (ok, there could be a few but I really don't see how this is just using a "monorepo" instead of a "multirepo")
The problem is when you have more than one such service. Now when one of them changes, all of them need to be rebuilt. You can solve this with multi-stage builds, but those only work if your build result can be easily copied.
Is the issue here that images shouldn't be thought of as layers, but rather a tree of cached directory nodes? I don't quite follow what's meant by building here, are you referring to compiling or merging the resulting build artifacts into a final container image?
Non-copyable build outputs sound a bit wild - you're thinking of builds that encode absolute paths into the output binaries?
There's also developer push from two sides; developers want to do microservices because it gives them gratification (new problems to solve! new architectures! Rewrite!), and employers want to attract developers (we do microservices! Blockchain! IoT!). So much in software development these days is hype and self-gratification.
"Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure."
-- Melvin E. Conway
Came here to say this - and it applies in the other direction. Microservices allow you to split work between teams without having to coordinate deployment and iteration cadence quite so tightly.
If you have a single team, you shouldn't be doing microservices.
I mostly see organizations with multiple internal dev teams who all have shared responsibility over all of the microservices (e.g. no-one is responsible for any service). The worst of both worlds: all of the complexity of microservices architecture, without the benefit of specialization and splitting work between teams.
The organizational structure doesn't have to be reflected in the built artifact(s), though. Just look at Chrome. It has who knows how many teams working on it, but gets built into a single giant DLL (or executable, depending on your platform). And newer languages like Go and Rust make it easier to link everything into one big artifact like this.
It’s depends on how you slice services. For one micro-service per a team I see benefits. On another end of the spectrum a single team managing 10-20 micro-services with more than one service per developer. IMHO it creates more problems than solves. Also it is usually a waste of HW resources because a library call it is cheaper then a network request.
A single team managing 10 microservices that actually make sense to be microservices (like the PDF renderer example above [1]) is kinda good and perfectly manageable.
A team with one single microservice that would actually work better if it was part of a monolith is already in the "creates more problems than it solves" territory.
I would frame it as a necessity rather than a benefit. Having siloed teams (services) is usually a problem which is better to avoid as much as you can.
Having autonomous teams is great for scaling and allowing everyone to go fast, without teams constantly blocking each other.
Having hundreds of engineers work in a single monolith in a single repo without any kind of (enforced) boundaries is a one way ticket to a big ball of mud. You need to invest heavily in tooling to make it work, and e.g. Google does so.
Having a network in between teams is a relatively easy way to enforce boundaries.
It allows everyone to go fast as long as the work is constrained within one service. It goes very slow once service / team coordination needs to happen and one team alone is not able to deliver the feature. This then often leads to services duplicating logic, amassing responsibilities in order to do as much as possible within "my" service to avoid this coordination bottleneck.
I'm developing an ML based sideproject. All the modern ML tools are written in python, which is a reality good language for it. However, it is an abysmal project for writing business logic and third party integrations, and if I have some free time I will split the whole thing into one Python and one Typescript service.
The problem is that we are now in the golden age of Web services, where everything is headless and controlled via API, so the business logic plugging all those APIs need to live somewhere.
Naturally it could be a single container taking care of all those integrations.
Microservices works better if you don't trust other team. While having trust seem like a basic thing, this is absolutely not the case for a lot of companies.
With microservices, it is easy to see services which are down or have high error rate or latency, have clear API contract and call out the team for breaking API contract, and assign cost for which the teams have incentive to reduce, or at least not increase it.
Another pain with monoliths is that they can only be deployed if the entire monolith is passing all tests. When you cannot deploy your changes because someone else on an entirely orthogonal team broke something in the monolith which is not related to you it gets old really quick.
Large monolithic repos with many independent targets for testing and deployment work the best at huge scales. If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.
I'd ask why people are merging things that break the tests? I worked on a monolith with hundreds of devs and I can count the time the tests failed because someone force-merged something in an emergency on one hand. It was generally unacceptable to merge something when tests failed; you had to have a really good reason.
Yes, this seems weird; merging breaking code is not an option. The 'breaking team' will have to wait/fix on their side, not us waiting on them for deployment of our working and tested features.
There are some extreme circumstances where pushing broken tests to production make sense. For example, if you push a simple change to simply `return false` and disable a feature in code. In this case, the tests using it will probably fail but the desired behavior happens in production. At this point, you have a bit more time to set the tests to 'skip' while the load is shed in production. Even if you break tests on purpose, you should fix them asap as you are blocking literally every other team in the company. Thus, you need a really good reason to do so (like if you didn't do it asap, global downtime would ensue).
When committing, do a ff-only of ‘main’ to your branch. Yes, this forces everyone to rebase before “merging” but in practice, this results in the least amount of failures, tests being run after you resolved any conflicts, etc.
If you can use GitHub merge queues, that solves a ton of this, and you can run tests on the final merge before actually merging instead of relying on rebasing.
"Don’t merge. Rebase only. Keep a linear history."
This. It makes life so much simpler. With teams that don't have a lot of experience with git, however, I tend to use the "Squash and Merge" feature, coupled with forcing a linear history.
Requiring a passing integration branch before merging to master. Merging the integration branch then becomes a fast-forward merge.
Alternatively, if you have a low enough merge volume, requiring mergers (by policy) to squash and rebase (and re-run tests before attempting to merge) can work too, as others have already mentioned.
1. Add a test with a time-bomb (such as a test certificate with 365-day duration), wait a year, and now your test fails without having changed.
2. Add a test with a network dependency, and when that dependency is slow / down / turned off, the test starts failing.
3. Add a dependency on a third-party Github repo that clones from `main`, and the next time some dev touches a file in that repo your test starts failing.
4. Add a test that allocates memory in proportion to size of the codebase (e.g. because it tries to build a giant in-memory tarball of all the .mp4 assets). Eventually it will get flaky when it starts scraping up against the build machine's limit. Extra fun if your builds run without defined memory limits on machines of different sizes.
In a monolithic build, there's all sorts of ways for a single person to cause other teams' tests to fail, even months or years after they've left the company. Some of them can be prevented mechanically (such as by running tests without network access), but a lot come down to "tell them to stop doing that".
That's why big companies never run one build per repo.
Hmm. We never had those specific issues. For (1), we had time bombs for sure, but those usually highlighted coverage issues. Dev ops would disable your test and tell you to fix your code.
IRT (2), network dependencies were forbidden in-general. Over any long enough timespan, the rate of failure is 100%. If you wanted to use the network, you had to consider the failure case and handle it in your tests.
For (3), all dependencies were committed as part of the repo. All dependencies had to be reviewed for any issues before being used, so this made sense. You simply weren’t allowed to just randomly include a new dependency without a review/PR to add it.
For (4), our dev environments had less memory than build machines and the same as production. If you couldn’t build it in a dev environment, it wasn’t getting committed without special treatment from dev ops (and a really good reason).
Yeah, all of those mitigations don't work as well when you've got thousands of engineers whose work would be blocked if some intern's badly written test blows up.
A monolithic build means that your ability to develop and deploy your team's code is dependent on every other team. As the number of teams gets larger, that multiplier really hurts.
It's a learning experience. If everyone learns from it, it probably (most likely) won't happen again. Everyone learns how to write better tests. And, like I said, if you absolutely need to merge something right this exact second and it can't wait until someone disables the failing test (or you can't do it yourself for some reason), you can always merge it even with failing tests.
Average tenure at most companies is 2/3 years. You cannot rely on people learning from mistakes because there are always new people, you need to make it so they cannot make mistakes or if they make a mistake it isn’t going to block the whole company from executing.
Organizations learn, it gets embedded in the culture, tooling, and automation. I shared some of the rules that were embedded in our organization via style guides and onboarding.
There were a thousand automated checks to prevent you from doing the same thing as someone else that caused downtime in the past. It was virtually impossible to commit code that deleted/truncated a table, for example.
>Another pain with monoliths is that they can only be deployed if the entire monolith is passing all tests. When you cannot deploy your changes because someone else on an entirely orthogonal team broke something in the monolith which is not related to you it gets old really quick.
You can just... not allow code not passing test into master branch. They can fuck around in their own one, that's what branches are for
1. Monolithic app and VCS mono-repo are orthogonal and can be mixed in different combinations (with different tradeoffs).
2. How about an old way of blocking deployments less: parts of this monolithic apps developed as libraries with a stable API, a new version of a library released only after its own tests has passed, next you can increment dependency version in the monolith and run integration tests, if they failing you can revert to the old library version and still go ahead with the deployment (if you don't depend on something added in the latest library release). If you depend on this new feature and a component providing it is broken micro-services would not help you.
> If you are only a few hundred engineers, monorepo with monolithic deployments and tests work fine.
And here lies a very important problem IMHO - many (if not majority) of organizations (which do at least some software development) have less than 100 software developers but the industry best practices (which include micro-service architecture) are defined by FAANG-sized organizations and at least some of these practices are sub-optimal for small shops.
It's not only tests, a rollback because your pdf generation has a bug will also mean rolling back for example a customer facing new API, slowing down the API team until the monolith is fixed or the change reverted and rebuilt.
> Two green changes merged to a green main can produce a red main.
I'm sure this happens occasionally, but I've never experienced it, and it seems to be rare enough that it's not that big of a concern. Especially since it'll be easily remedied by either just fixing the error or just reverting one or both of the changes.
It's definitely an issue if you're treating main as 'good to deploy'. Also with more devs the chance of this happening is pretty good. Definitely want to enforce a branch is up to date before allowing merge (though with a lot of devs this can become difficult) not sure if there is some sort of merge/test pipeline solution out there.
It completely depends on your organization's merge volume and your codebase's complexity. I've seen it happen many times working on a large monolith. But requiring a passing integration branch build before merging to master, or a merge queue, solves that.
IMO for these rare occasions it should be okay to unmerge PRs.
Or what could also be done is you only deploy cherry-picked releases, granted that you have a way of tracking commits in the upstream such that no commit ever gets lost.
That is possible, but at that point you need a integrations/release team to babysit the builds and make sure everything integrates back together cleanly.
The last thing I want to do is have my build broken by someone on another team and then have to track them down and babysitting the revert. That is easily an hour of my time wasted.
I think the trust issue doesn't really justify microservices. Assuming everyone is using interoperable languages you can still have a monolith with clear API contacts and separate ownership by using traditional libraries.
That is a good point about reliability and cost though. I hadn't heard that before.
What if some team makes the their part 10 times slower. This is not a theoretical scenario, but one I saw happening many times. While technically you could partition and monitor each part of the monolith separately, but then you are just reinventing microservice architecture.
You just run a profiler. It's better and arguably easier than microservice profiling since it doesn't only measure at the interface, can measure fine grained memory usage, etc.
Google "continuous profiling".
I'm not sure why you would think that that reinvents microservices.
Microservices is just taking a monolith and moving the components into separate processes that communicate via RPC.
In my experience, profiling is hard and a lot of times doesn't show the issue. e.g. for unnamed goroutines it is hard to see which goroutine profiler is referring to. Or say if some code change increased CPU usage without increase in time/memory, it will affect the entire monolith performance. Yes a good maintainer could pinpoint the issue, but remember my premise was that it is low trust environment, and saying I think your code change increased CPU usage involves talking to managers and being in 2 meetings. And in microservice they would have to deal with their alerts to not miss the SLA.
> Microservices is just taking a monolith and moving the components into separate processes that communicate via RPC.
Microservice architecture divides the responsibility much more than that. They have separate redis cache, local cache, tests, and even likely has different DB etc.
I haven't worked on a microservice architecture yet, but this is a very interesting idea that I hadn't heard before. That micro services can potentially give greater visibility of each team's performance, improving accountability.
It's way too easy to just add a lot of friction if you go too far.
I also think it would be way healthier if teams acted as "maintainers" rather than "sole developer" of a service.
For example if team A wants feature from service team B manages they should be free (after communicating that so there is no confict/work duplication) to just make that feature and submit pull request to the team B.
Then team B can make sure it's up to their standard but that's shorter work than getting the whole machine of "submit a ticket for team B to add feature, find manpower to do it, and schedule work" running.
It's incredibly easy to game that architecture in a low-trust environment, though. If a team owns the interface definition of their microservice, they can just declare all callers' problems an instance of "holding it wrong".
That's my experience too. Microservices are first and foremost a technology to dilute responsibility, and if you're clever about it, you can even let it fall through the cracks completely.
I absolutely agree, buuuut also realize we as programmers don't even have the same definition of what a microservice is.
A lot of people here say...one service per team. But to me that is, or can be, a monolith. Often a team is a product line, so you have one service for that product. Is that a monolith? I don't know either, I guess.
I -do- know most people who go around promoting that sweet microservice life end up being the worst. They seem to want every db table to be its own service, introduce tons of message passing and queues, etc, for absolutely no reason. I think we can probably all agree that is about the worst way to go about it.
Devils advocate, but is it possible microservices need a shared feature-ful message passing layer, and good tooling, to be working well? Eg schema, auth, flow control, ttl, persistence, partitioning etc in the message layer, a la Kafka? I mean it’s kind of implied that microservices can only work if they can talk to each other, and “talking” is a lot more nuanced than we tend to think.
Comm requires massive overhead versus simply 'calling a function'. Calling a function doesn't fail. Maybe the code in it does, but not the call to the function itself.
This is why microservice costs often far outweighs the benefits, but they rarely consider the cost in their crusade to 'break up the monolith'
Absolutely. But in the cases where it a monolith is not doable, such as highly heterogenous hardware requirements for different “services”, it seems like the messaging stack plays a more important role (even if that’s only needed in 5% of use cases). Typical HTTP request-response, which is often fine for a monolith, is not enough for building say task queues. A strong messaging layer can reduce the need of ad-hoc wheel reinvention.
In my org in Google we average over one microservice per engineer. I'll be adding two in the next couple weeks. With the right automation setup you don't notice them any more than you do server instances.
The CFO read an article in Forbes that said we would save money by migrating to the cloud, so we now have an unlimited budget for consultants to build us a cloud platform....
Your job is to figure out what they actually need even if they don't understand it?
Seems pretty par the course.
There are dozens of reasons to migrate to the cloud. Do they apply to everyone? No. Are they always worth the cost? No. But the whole "cloud vs not cloud" argument that happened, got settled ("cloud"), and is now being restarted by the DHH-like is not data-driven and full of exaggerations and fear-mongering from both sides.
Then you add on top of that that the main product of moving to the cloud is "operations" which is typically measured in "hours of human capital being impacted outside of core working hours". When the market is booming, tech humans are expensive and fickle, and don't want to undertake more operations than they should have to, and companies are forced to pay cloud providers.
But in today's 2023 climate, any company looking around to decide how much to spend on cloud just says "Why would we pay for something when we can just ask our engineers to work more hours, and invite them to quit if they don't like it, oh wait nobody else is hiring anyway"
No cost calculator of $$$ saved considers that overtime is free in our industry.
tl;dr the cloud backlash is overblown, more companies/businesses would benefit from cloud than not.
This right here is one of the reasons I got out of software development. Not micro services in particular, but just the unthinking application of some new pattern to everything.
Everyone wants to do the new cool thing. Everyone wants it on their CV. To be followed some years later by everyone saying how awful it is, and moving on to the next fad. Rinse, repeat, round and round we go with no actual intelligence being applied.
Microservices have lately seemed to me to be a buzzword for the ears of executives and stakeholders. To someone who isn't technical enough, it seems really "cool" from the outside, but on the inside, it's more than often a shitshow with teams and managers messing around to get these services working with each other properly while wasting a lot of time.
If you ask me, if the time and focus is invested properly, it would be much more efficient to run a monolith instead. That's what some small number of great teams end up doing.
Not only that, if you spoke out against microservices you were labeled not a ‘team player’ outcasted to maintain ‘old’ code that runs the entire company as a bunch of hot new devs create a mountain of crap services only to quit when they got bored.
Yeah that seems closer to my experience. From my perspective, 2016ish was peak. At least thats when I had to to argue the most against trying to needlessly break up services.
There's a bunch of things I'd like to have them do. If they could span across machines like clusters that would be amazing.
If I could trivially package them up and deploy them locally with intrinsically less effort and wall-time then the old way, that'd be amazing.
If I could somehow get the horizontal scaling promises and redundancy as some kind of built-in, like I can with say, memcache, that's be cool.
If I could do these kinds of "hard" things with them more trivially, that'd be really nice.
There's a lot of things I want them to do but it's a god-damn bull-riding rodeo every time I try to get there.
And before you reply, I know you're an expert and can do all these things trivially. That's amazing. The vast majority of the industry creates a giant fragile spaghetti knot with them and I am not a full time k8s admin nor do I want this to be a career trajectory. It should be like you know, wine, ffmpeg, imagemagick, virtualbox, lua, qemu, redis, gnuplot, lvm2, gdb, ssh, sqlite; tools like that. It's pretty easy to get them to do really nice things. Those things deliver on their promises and potential pretty nicely.
It's nice that nobody feels a need to hype curl or squid. They just work. Isn't that nice? I mean look at gdb's website: https://www.sourceware.org/gdb/ it doesn't even have CSS animations --- in fact, it doesn't even have CSS.
Absolutely. It's pretty brilliant. I thought about mentioning it. I don't hate it. It's a bit exotic to trust with the likes who tend to fill the ranks of development teams which try to force every language into looking like C++ of Java but I dunno, send them off to a retreat in the mountains to microdose and take lessons from an erlang yogi for 12 weeks and have them come back.
Your asks seem easily answered with docker + kubernetes. Actually, this is in fact the use case for kubernetes — a fault tolerant distributed system running arbitrarily, simply packaged code. This has to be what you’ve tried — what issue are you running into?
Kubernetes isn't something I'd put in the same sentence as 'easy'. Docker is a close contender for the same.
I still recall the day when my local Docker builds necessitated a new router to properly manage streaming traffic at home while I downloaded a few GBs of layer images. Or the time I wanted to setup a 'simple' hosted Kubernetes cluster of my own in my lab for testing, only to discover the nightmare that is networking on it. Then there was the grim discovery that Docker containers were much more sensitive to the hosting environment than I had assumed, resulting in some fun "but it worked and tested fine" moments.
Did they all work eventually for me? Yes. Was it simple? Not by my standards.
When they say k8s is "easy" they mean "for developers", not people/automation running it.
Like, compared to implementing hitless rollback over bare metal services k8s way is "easy", just set some stuff in YAML and have proper healtchecks in your app.
trying to using the actual software to accomplish these actual tasks. You're right though, that is the promise of the software - it doesn't deliver.
I wish I had infinite time to document all the issues. This isn't a small nuanced detailed thing - it falls deeply, systemically fundamentally short and in practice you still get the magical monolithic system it tried to kill but now with more obfuscation, complexity and a theatrical slight of hand to convince yourself it isn't that.
Instead of the server being configured for the monolithic app, it's now extensively and carefully configured for the myriad of containers, hostnames, configurations and connections of the containers running the microservice app.
It's in practice the same problem with a different costume.
The other promise of it being a collection of smaller constrained services running on tcp ports talking to each other ... that's nothing new. You've invented the idea of computer networks.
The issue I encounter is the overhead in setting up a repeatable, easy-to-use dev environment, and working out the bugs locally before I push to a prod-like system.
Like any tool, there is nothing that cannot be done without microservices. However that doesn't mean they never make sense. Microservices have certain costs and certain benefits. I can believe there are certain situations where the benefits outweigh the costs. Its just not most situations. But that doesn't mean it never exists. I could believe it makes sense in extremely large apps with huge number of different groups working on them, where the communication complexity outweighs the other complexities microservices bring.
That's very hypothetical. Building a distributed system just might always be way more resource intensive than comparable monlith regardless of number of developers involved.
Remove reliability intercorrelations, so for example that your cart api and payment gateway is up and collecting orders no matter to what happens to the front-end services.
But then for perfect decorrelation you'd also need independent databases behind the microservices, and queues between them for horizontal communication,and few are actually going all in with that, and so fall in the 95% where they go trough te motion and the effort of splitting microservices,but reap no actual benefit from it.
The problem I see in many projects, is that they start out as - or implementing - a microservice architecture. I think this is backwards; you should start with a monolith and separate out concerns into microservies if it makes sense, not because it's "cool."
I agree, but aside from it being seen as "cool", what drives some engineers to go microservice first architecture is having experienced the inability of an organization to acknowledge that they actually do need to re-write a monolith as two or more services or undergo a general re-architecting of the monolith. Getting buy-in from the business is extremely difficult as clearly communicating the actual effort and impact that re-architecting the monolith would require is nearly impossible. This is usually due to poor separation of domains via lack of modules within the monolith, spaghetti code, circular or other strange dependency trees, tables with relationships or data that should never have existed in those tables, and a whole other set of other bizarre issues that were due to lack of planning and general discipline by engineers along the way.
If you have a microservice first architecture, the perception is, it's easier to describe effort to re-write an individual service or split it into two services as there is a clearly delineated body of work. Bizarre service-to-service dependencies may still exist and a poorly implemented microservice architecture is still a potential challenge.
Point being, organizations incentivize bad economic decisions on the part of engineers through the inability to recognize that rework is a necessary aspect of developing software and by constantly eschewing rework in favor of feature delivery it sends a strong message to the engineer about what to prioritize.
Yeah, but writing a big chunk of new code always involves either gambling or cargo culting, until you nail the actual requirements and the design. MSA is just a methodology to contain risks from the uncertainty, and it never says you must build everything in MSA. It's often better to migrate mature code into (semi-)monolithic services.
Microservices make sense for a lot more than 5%. If fact I think it is much closer to the 80/20, 80% working on serverless, 20% not. Video streaming obviously not going to work on AWS Lambda to begin with.
I finally realised after using Lambda for almost a decade (started to use it when it was released 9 years ago) that instead of think about apps that you map to lambda functions you should think about features instead.
A simple example: I have a SPA that has the following features: auth(login, logout), dashboard, feature a, feature b. I can write a few very simple lambda functions and deploy these the same way (IaC). What do we (my team) win? We can implement each function in a language we want. You have a feature that is too slow? Rewrite it in Rust. You have an amazing Python lib for feature a? Use Python. What else? We almost never touch auth, so if a feature has a bug it does not impact the entire application. Security is better because we can allow individual functions to access part of the infra they really need to access. Lambda functions can call other lambda functions as well.
Downside is that we cannot use a shared cache that is easy with a monolith. People need to design the boxes well which functionality goes to which lambda function. We have to use distributed trace ids to track requests.
I kinda thought about making "monolithic lambda", where there is just interface to get the request, respond to request, logging, and maybe some queue to talk with other components, for some of my personal tools.
Basically cut down the cruft when deploying another small self-contained feature but still keep the code running (savings of few MB memory are meanigless if you just have few dozen features that might run at the same time anyway).
Then I realized it's basically reinventing the ancient idea of "application server" like JBoss and EJB... which is kinda the case for lambda anyway.
The problem is more like that many people don't understand the tradeoffs and when to use microservices. This becomes even more obvious when you ask them what their current architecture is and what problems they hope to solve for that needs a transition to another architecture.
The reason for doing microservices I've been given by a two person developer team that had created a 15+ microservice single-server k8s monster was: 'this is how it is done today' Yeah, IT is like the fashion industry.
Yes but can we also consider “3p APIs that should have been a library” as microservices? It feels like that model has sneaked in as common practice but it suffers the same (and more) problems as multiple (1p) microservices.
My team owns an API monolith that hosts several completely unrelated endpoints. I keep thinking this would be a good candidate for breaking into microservices, but I do wonder if I'm buying into the hype.
Massive architectural refactors are so attractive (at least to the kind of mind that likes Factorio) and so expensive. At least make sure you've got some concrete benefits that you think might arise from breaking apart the monolith, so you can do some semblance of a cost-benefit analysis!
When you say unrelated, are you sure? Do they share -- or should they share -- a common underlying relational data model?
My biggest grief with microservices is the fact that it's effectively become a war on having a coherent logical normalized relational data model inside an organization.
This is a good question to think about. Right now, they don't have any relationship beyond serving different parts of my group, but perhaps they could be redesigned in a more cohesive way.
The core pessimal problem presented by microservices is this: someday some stakeholder is going to ask for some information to be joined together or interlinked -- and that information will have been unwisely put into separate services ... and now you'll be doing joins manually via web service calls -- over the network -- and somewhere the ghost of E.F. Codd is spinning in circles and cursing you.
I think it's actually quite rare for companies to have data so actually autonomous and unrelated that it does not logically relate to anything else in the organization.
However, I think there is something hiding inside the µservices movement that is actually much more generally applicable and useful: API-first development.
I would argue that in such cases those are not "micro" services anymore, they are services. In that case it makes sense to develop and deploy them separately, then find a way to make them talk to each other. Microservices is a different architectural decision.
My opinion is that there is a point where that is true... but its at a really high scale. Each team owning a separate service introduces a lot of complexity in managing all the services. There is a point where communication complexity of not microservices overwhelms the complexity implicit in microservices, but i think it is at a really high scale.
You also have to consider things like it is now harder for people to see the system as a holistic whole (the tricky bugs are often in the composition of components) and a lot of subtle effects that beings. Even just increasing the friction for people to move between teams or friction for security people to apply consistent standards across all groups.
And then team 1 needs to upgrade pandas to 2.0, but team 2 is still on pandas 1, so when the main app pulls them in nothing works, so you need to start a cross-team committee to schedule the work to upgrade a single library...
Separate services aren't a silver bullet, but as an fyi to the younger software developers, we tried "just have all teams work on the same code base and deployable artifact" for a long while and it didn't work very well either.
All the teams will need to migrate sooner or later so figuring out all of the potential problems in migrating and having everyone do it an once is more efficient than each team needing to figure it out separately.
That's not how it plays out in reality. usually nothing gets done because "upgrade this package" is never on anyone's priority list. Or teams end up doing shit like JAR shading or forking and renaming a package with some _v2 or whatever suffix to be able to support both the old and new version simultaneously in the main code base. And then of course nobody ever updates the runtime (hello, enterprise monoliths still running on Java 6/7!) It ends up being a complete mess.
I'm just grateful a language like Java has any sort of namespace solution to dependency nightmares.
I have lost track of how many times I did a git pull on a Python based solution only to find I broke all the things when I tried to upgrade one package.
Imagine a solo developer, writing an app that is composed of packages/libraries/crates from the get go.
Now in one place such an engineer uses pandas 2 in other place pandas 1 but it is just one single app. What does it say about the quality of engineering and mental focus of such a solo developer that cannot accomplish same thing with the same API - OR cannot refactor the already written code for Pandas 1 to Pandas2.
Sounds to me like more of an engineering discipline and engineering mindfulness problem.
Fix is simple with a simple rule - everyone has to use the latest major version, always.
Micro services do not make any sort of people's communication go away, they move it to different boundaries. From dependencies to the business layer/interfaces which is lot harder to navigate and negotiate.
Imagine needing a field in your downstream service. They refuse because they don't see it their domain and you cram it on your side and what not. Ask anyone working in micro services environment and they'll tell you it is a recurring issue every quarter if not more.
That's easy: we'll make the ultimate build system! It will scale, and maintain packages, and compile all the things transitively. Just give me $xx million dollars and a few years, and I'll give you the perfect solution.
Just press this button to start the upgrade build and....boom! 10,000 services and their dependencies being built on a ton of hardware; we can practically gurantee your change in dependency will be checked... Whoops, turns out your one dependency change cascaded into about 1.5% breakage....no, I don't know who owns those packsges; why do you ask? That's not my job!
Yeah I see a lot of things that could be libraries packaged as services, so now each invocation incur in network latency and every transaction needs a two phase commit. And because each service need its own replica, deployment pipeline, and versioned internal api, production and deployment cost skyrocket
Because, you see, if you surround shit with other shit, that original shit doesn't look quite so bad in comparison. So take your shit monolith, surround it by shit services that distributed it across a shitty network, and now your original self inflicted shit design is just 1/3rd of the shit you gotta deal with. Totally not as bad as it used to be!
Until one team need a feature in another service that makes their development grind to a halt and the other team is not prioritizing.
I have only seen this from the business side (I'm not a developer), but I have seen teams start coding in another teams service just to be able to proceed.
It's not always good to create silos like this either.
Sounds like someone has been in the trenches of a certain online retail company.
As a developer, I have certainly seen the same. Pretty sure this very scenario is where I heard the term "away team" used in the industry: send your folks over to change things, and under our guidance they can check in the code.
AWS has a great business model of people over "optimizing" their architecture using new toys from amazon and being charged through the nose for it. It's amazing how clients that are doing a few requests per second will want a fully distributed, serverless, microservice + dynamodb + s3 + athena + etc + etc, in order to serve a semi-static web app and print some reports off throughout the day and pay 10-50k a month when the entire thing could run on a few nodes and even a managed RDS instance for a thousand bucks a month. I would argue at this point that early optimization of architecture is astronomically worse than even* your co-worker that keeps turning all of your non-critical, low-volume iterable functions into lanes to utilize SIMD instructions.
Some irony in my anecdotal experiences is that most places that don't have the traffic to justify the cost of these super distributed service architectures also see a performance penalty from introducing network calls and marshaling costs
Yes, and it attracts just the wrong kind of dev/architects.
At a previous shop, we hired a cloud architect to drive our "cloud adoption".
He of course bet the farm on a set of new AWS services that were barely in version v0.9 to be the backbone of the system he architected.
It quickly became clear even he had no experience with the set of tools & services he had advocated, and the whole thing went off the rails slowly & surely.
Low & behold 100% of existing customers are still on the on-prem offering 2 years later, and if you throw in the new customers that were shoehorned onto the AWS offering, his team has captured 2% of customer use after 2 years of effort.
> AWS has a great business model of people over "optimizing" their architecture using new toys from amazon and being charged through the nose for it
I was back on AWS for the first time in a few years this week and the amount of new "upsell" prompts in the console is ridiculous. Spin up an RDS instance - "hey, would you like an Elasticache cluster too?". I think AWS are very aware of this behaviour and encourage it. Simplicity is not in their interest.
It's honestly like a cult and a desire to want to "do it right" on AWS. The last few projects I've spent so much time setting up code deploy, load balancers, certificates, SES, route 53... This newest project, I've gone to heroku with everything being basically a few clicks to get setup.
So guys we need Lambdas + Step Functions + SES + SQS + SNS + MSK + AWS Batch + S3 + Lakeformation + Cloudformation + Athena + EMR + Redshift + Aurora + SageMaker + Cloudtrail + Codepipeline + maybe some EC2s to run AWS CLI on them.
Don't forget to configure Route53, VPC, IAM and an ELB.
Great - ready to start writing your app now?
Oh wow one of those components as configured with the other components isn't behaving as expected - time to contact AWS support!
Cynically I think CTOs see all this stuff and think they'll turn all their expensive on-shore devs into cheaper DevOps because AWS is magic and you don't need to write hard app code anymore.
I'd counter that AWS forces expensive on-shore devs into having to wear an entire new hat and be half a DevOps engineer to figure out how to make their code work on this alphabet soup instead of a Linux server.
It seems like another case of the road to hell being paved with good intentions, most places want/need redundancy and some managed devops and so a few ec2 instances and managed RDS is affordable enough and checks a lot of boxes, but after people start down this path it seems almost irresistible to start drilling down into managed kubernetes, spark jobs, to start ingesting some events we'll just introduce glue, and that plugs right into S3, and look how easy it is to plug in athena, add some quick alerting with cloudwatch, and the next thing you know you're vendor locked and having to hire a full time devops person with AWS experience to configure, manage and keep on top of it all.
It is even more amazing when the entire $10k AWS setup can be replaced by a single minimally optimized monolith running on one $20/month Hetzner server that responds several times faster to most requests due to no internal latency.
A $5 instance gets you something like 1 core on a slightly dated CPU. Aka approximately as much processing power as a top of the line desktop processor 15 years ago. Aka enough processing power to fill a 25 Gb/s port with TLS data (not that you have the port to go with that).
A few requests per millisecond should be well within the capabilities of this instance, depending on the complexity of each request of course.
Tons of the pieces you mentioned are probably not that expensive to run for a small use case, given you're only charged on demand. The cost is really in the dev ops time and expertise to orchestrate the whole affair, and in the new ways it can break.
I’d note each of what you mentioned cost $0 at zero scale and nominal $ at small scale. But you’re right, engineers new to aws try to flex all the kit together for not much benefit. For a semi static website all you need is s3+cloud front+api gateway+lambda+dynomdb for state. This would cost you basically $0 for small scale, and there would be nothing to monitor. It either works or aws is down.
I kind of see the opposite. Relying heavily on stuff like lambda has scaling limitations but it’s fast to get up and running. Built-in interactions between AWS services can do a lot of the lifting for you. And then if you find out that’s not a great fit for what you’re doing you can put in more bespoke pieces.
I actually worked on an Azure based project recently and it was very similar.
It was a small semi static contact form that was deployed on 27 web apps (9 services x 3 environments) and used a NoSQL storage, redis, serverless stuff, etc.
Insanely complex deployment process, crazy complexity and all over the place.
The subtitle is "The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs."
And the article itself mentions the 90% cost reduction.
So the title seems pretty much in-line with the original intent.
But, by omission is reads that Prime Video rebuilt their stack without serverless and got a 90% cost reduction.
This post is going to pick up a lot of traction and I suspect these comments are going to bikeshed monolith vs microservices for the next day.
On reading it, this is for a video quality monitoring system, that needs to consume and process video. Generally a compute and time intensive task. Something not always suited to severless, particularly when it’s not easy to parallelise.
The task at hand doesn’t sound ideally suited to serverless, but the existence of the post shows that’s not readily obvious. So it’s a valuable post to explain a scenario where a few big machines is the best call.
But the sensationalism of the headline, would suggest all serverless is expensive and wasteful. When in reality the same is true for a non-ideal workload on a monolith.
Serverless has such bullshit insidious pricing that makes it seem like you're saving money only to figure out you're in shit once you're knee deep in it.
For example you'll have to read fine print to find out that 256MB lambda will have the compute power of a 90s desktop PC because compute scales with memory. And to get access to "one core" of compute you have to use like 2GB of memory.
Now you may say "serverless isn't geared towards compute" - but this kind of CPU bottlenecking affects rudimentary stuff - like using any framework that does some upfront optimizations will murder your first request/cold start performance - EF Core ORM expression compiler will take seconds to cold start the model/queries ! For comparison I can run ~100 integration tests (with entire context bootstrap for each) against a real database in that time on my desktop machine. It's unbelievably slow - unless you're doing trivial "reparse this JSON and manually concat shit to a DB query" kind of workloads.
You could say those frameworks aren't suited for serverless - or you could say that the pricing is designed to screw over people trying to port these kinds of workloads to serverless.
The problem isn't paying for cold start - the problem is they make the low ram lambdas very very niche by CPU scaling - you can have a 256 mb web server that talks to a database easily - and that's their supposed selling point - but having it served on ~300MHz CPU is really really limiting - and they should be upfront about that.
If you went to a car rental and they told you we have a cheap car that's slower when you add passengers - and then you drive it to pick up your wife and it turns out it only goes 20 km/h when your wife gets in - you would be rightfully mad. You could say "why didn't you ask for specifications" but you have certain expectations of what a car should behave like and what they gave you doesn't really qualify as a car no matter if their disclaimer was technically correct.
Do you need a screenshot and red box around the text or would you believe me if I tell you it is written on their lambda pricing page near the beginning ? It's also written in docs about configuring lambad functions so at this point it is PEBKAC/RTFM issue, not "them not being upfront"
And frankly it is done that way because they have standarized machines, scheduling CPU heavy/memory light and cpu light/memory heavy is extra complexity. I mean ,they should, but they have no real incentive to, as in most cases apps written in slower languages are also memory-fatter so it fits well enough
> If you went to a car rental and they told you we have a cheap car that's slower when you add passengers - and then you drive it to pick up your wife and it turns out it only goes 20 km/h when your wife gets in - you would be rightfully mad.
Getting lowest tier one is more like renting a 125cc bike than a car if anything. You can do plenty with that limit in efficient language too.
>Do you need a screenshot and red box around the text or would you believe me if I tell you it is written on their lambda pricing page near the beginning ? It's also written in docs about configuring lambad functions so at this point it is PEBKAC/RTFM issue, not "them not being upfront"
Simple CPU time calculator on the pricing calculator page when you enter the RAM would be sufficient, linking to the said docs. Trivial to implement, really cleans up things when planning resource costs.
All of things you are complaining about are well known facts that are clearly stated in the documentation.
I don't care about what is the equivalent computing power in 90s desktop measurement because you cannot replace a lambda function with a 90s desktop, so it is pointless.
The right approach is: I have a problem A that I can implement using AWS Lambda, AWS EC2 or your favourite DHH approved stack, how much of these cost compare to each other.
Can you point me to where this is clearly stated in the documentation ? I only found one reference as a passing note when I went searching for it. This would be a value displayed in the pricing calculator with a link to explanation if they were being honest.
90s CPU comparison is just to demonstrate how out of place it is with what people are used to even on lowest tier hosts with shared CPU cores. Low ram compute seems to be artificially limited to make low ram lambdas useful in very narrow use cases.
For reference I have a devops team in-company that deployed and maintained several AWS projects, including some serverless, even they were surprised at the low compute available at low RAM lambdas.
Memory is the principal lever available to Lambda developers for controlling the performance of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The Lambda console defaults new functions to the smallest setting and many developers also choose 128 MB for their functions.
It is known that at 1,792 MB we get 1 full vCPU1 (notice the v in front of CPU). A vCPU is “a thread of either an Intel Xeon core or an AMD EPYC core”2. This is valid for the compute-optimized instance types, which are the underlying Lambda infrastructure (not a hard commitment by AWS, but a general rule).
If 1,024 MB are allocated to a function, it gets roughly 57% of a vCPU (1,024 / 1,792 ~= 0,57). It is obviously impossible to divide a CPU thread. In background, AWS is dividing the CPU time. With 1,024 MB, the function will receive 57% of the processing time. The CPU may switch to perform other tasks on the remaining 43% of the time.
The result of this CPU allocation model is: the more memory is allocated to a function, the faster it will accomplish a given task.
Yes this is a ridiculous clickbait. For once the original title is not and the poster had to make it so... Why is dang not changing it back?
PrimeVideo is very much based on a microservice architecture. Hell, my team which isn't client facing and has a very dedicated purpose has easily more microservices than engineers.
I guess all titles are clickbait to some degree. That said, the OP should have used the original title. Dan G. often corrects this mistake after the fact.
"We built a video stream processor by splitting every 1080p+, multi hour long, 30-60fps video into individual images and copying them across networks multiple times."
Not surprising that didn't go will. This strikes me as a punching bag example.
Anyone who has worked with images, video, 3d models, or even just really large blocks of text or numbers before (any kind of actually "big data") knows how much work goes into NOT copying the frames/files around unnecessarily, even in memory. Copying them across network is just a completely naive first pass at implementing something like this.
Video processing is very definitely a job you want to bring the functions to the data for. That is why graphics card APIs are built the way they are. You don't see OpenGL offering a ton of functions to copy the framebuffers into ram so you can work on them there only to copy them back to the video card. And if you did do that, you will quickly find out that you can be 10x to 100x more efficient by just learning compute shaders or OpenCL.
You could do this in a distributed fashion though, but it would have to look more like Hadoop jobs. I predict the final answer here, if they want to be reasonably fast as well, is going to be sending the videos to G4 instances and switching the detectors over to a shader language.
In general, if the data is much bigger than the code in bytes, move the code, not the data.
IO is almost always the most expensive part of any data processing job. If you're going to do highly scalable data processing, you need to be measuring how much time you spend on IO versus actually running your processing job, per record. That will make it dead obvious where you should spend your optimization efforts.
To be fair it is somewhat a punching bag example but I think what people are reacting to, but maybe not articulating well, is the presumption for microservices by the powers-that-be.
Of course the only rational take on monoliths versus microservices is "use the right tool for the job".
But systems design interviews, FAANG, 'thought leaders', etc basically ignore this nuance in favour of something like the following.
Question: design pastebin (edit, I of course mean a URL shortener not pastebin)
Rational first pass but wrong Answer: Have a monolith that chucks the URL in the database.
Whereas the only winning answer is going to have a bunch of services, separate persistence and caching, a CDN, load balancing, replicas, probably a DNS and a service mesh chucked in for good measure.
I think this article shows that this is training and producing people who can't even think of the obvious first answer they have been so thoroughly indoctrinated.
I think the realtime requirement removes hadoop as an option. They might have considered using HDFS as the data store instead of S3, since putting lots of objects into s3 is expensive. Or just using a big EFS volume instead of S3.
It would be nice to know how much latency there was in the microservice version vs the monolithic version.
You never get "realtime" in data processing. Actual realtime systems are a totally different animal. Mostly done in the embedded space, the design of a realtime processing system involves setting up fixed time windows for each task that needs compute time and optimizing the code for each task until it fits into the time window for it, on every execution, every time. This is done in order to provide hard guarantees on how fast a system can respond to new data flowing in. It's usually only safety critical systems that actually have such responsiveness and delivery time constraints.
I point this out because how we talk about a problem determines what solutions we even acknowledge as being on the table here. Saying it's a realtime system when it isn't, or thinking we need realtime processing when we don't, makes people throw out solutions per-maturely, that the thrown out solutions are often right answers.
Once you acknowledge that your system will not be "realtime" and you actually don't have the time-boxing and specific time window delivery constraints that actual realtime problem spaces have, you can weigh all of your actual options with an eye for what will be fastest and most efficient given the budget and hardware you have to throw at this problem.
This is not a discussion of monolith vs serverless. This is some terrible engineering all over that was "fixed".
Some excerpts:
> This eliminated the need for the S3 bucket as the intermediate storage for video frames because our data transfer now happened in the memory.
My candid reaction: Seriously? WTF?
I am honestly surprised that someone thought it was a good idea to shuffle video frames over the wire to S3 and then back down to run some buffer computations. Fixing the problem and then calling it a win?
But I think I understand what might have lead to this. At AWS, there is an emphasis on using their own services. So when use cases that don't fit well on top of AWS services come up, there is internal pressure to shoehorn it anyway. Hence these sorts of decisions.
This is what L6 and L7 are building at Amazon, meanwhile in sys design interviews I’m being asked to design solutions for a gaming platform with 50M concurrent users.
> This is not a discussion of monolith vs serverless. This is some terrible engineering all over that was "fixed".
I feel that's like 95% of the "we migrated from X to Y and now it is better"; most of improvements coming from rewriting app/infrastructure after learning the lessons with only small part sometimes being the change in tech
I wouldn't be surprised if the actual story underneath was that they got to a "works well enough" implementation and then forgot about the inefficiencies until someone looked at costs, connected the dots, and went "ok yeah we need to optimize this architecture."
I've seen some staggering cost savings realized because someone happened to notice that an inefficient implementation that wasn't a problem two years ago at the scale it was running at back then did not age well to the 10x volume it was handling two years later. The reason it hadn't fallen over was that horizontal scaling features built into the cloud products were able to keep it running with minimal attention from the SRE's.
To the contrary, from my time at Amazon, I felt that developers want to use more high level AWS services. Unfortunately, the landscape of AWS services is so rapidly evolving that Amazon engineers themselves cant keep up and end up using the wrong service.
As mentioned in other comments, there are options such as Fargate, that would still technically be "serverless" and still yield similar cost reductions. Not to mention that AWS also has Step functions express for "on host orchestration" use cases. This seems like a case where the original architecture wasn't very well researched and nor was the new one.
It’s still all Amazon, the single publicly traded company. Legal shenanigans/optimizations don’t change that. The other commenter was referring to AWS the org over Amazon Retail or Devices (other orgs).
Amazon Retail and AWS are the same legal entity for stocks, but other than that they might as well be separate companies.
Retail uses AWS with all the same APIs and quirks as any other company. The only thing different is the negotiation on price (which many large companies also do).
Meanwhile, AWS is apathetic towards feature requests from Retail, and especially operational support for Retail.
In many ways Retail would be better off if it was a separate company and could threaten AWS with a multi-cloud diversification play.
GECKO all the way. :) I think AWS gave a reasonable price to Retail. The migration caused the biggest outage of the website but at the end there was some pretty nice cost saving on the YOY infra cost.
I worked on both sides. I mostly agree except there are cases of important projects including AWS (like some of the ML work), also the whole aws usage discount/pricing thing is pretty huge and clearly the value in being within the same company. Retail would have a pretty hard time existing nowadays if they weren’t connect to aws imho.
Don't. There's no benefit to using metal as opposed to the largest virt (which will take up the entire server anyways) pretty much. Metal just tends to be somewhat less reliable. Source: I work here.
Sure, mine was a tongue-in-cheek comment, but there are cost benefits of bare metal in some use cases, especially if your workload is more or less predictable.
The title is editorialised to be clickbait. The original title is "Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%".
They changed a single service, the Prime Video audio/video monitoring service, from a few Lambda and Step Function components into a 'monolith'. This monolith is still one of presumably many services within Prime Video.
The subtitle is "The move from a distributed microservices architecture to a monolith application helped achieve higher scale, resilience, and reduce costs."
And the article itself mentions the 90% cost reduction.
So the title seems pretty much in-line with the original intent.
Prime Video has hundreds of teams, VQA is a tiny team that owns a very specific QA service. Omitting that distinction from the title absolutely is clickbait.
I wish this was a good condemnation of microservices in a general use case but it is very specific to the task at hand.
Honestly, the original architecture was insane though. They needed to monitor encoding quality for video streams so they decided to save each encoded video frame as a separate image on S3 and pass it around to various machines for processing.
That is a massive data explosion and very inefficient. It makes a lot more sense that they now look for defects directly on the machines that are encoding the video.
Another architecture that would work is to stream the encoded video from the encoding machines to other machines to decode and inspect. That would work as well. And again avoid the inefficiencies with saving and passing around individual images.
> Another architecture that would work is to stream the encoded video from the encoding machines to other machines to decode and inspect. That would work as well. And again avoid the inefficiencies with saving and passing around individual images.
No, that’s still a bad architecture. Bandwidth within AWS may be “free” within the same AZ, but it’s very limited. Until you get to very very large instance types, you max out at 30 Gbps instance networking, and even the largest types only hit 200 Gbps. A single 1080p uncompressed stream is 3 Gbps or so. There is no way you can effectively use any of the large M7g instances to decode and stream uncompressed video.(Maybe the very smallest, but that has its own issues.)
In contrast, if you decode and process the data on the same machine, you can very easily fit enough buffers in memory, getting the full memory bandwidth, which is more like 1Tbps. If you can process partial frames so you never write whole frames to memory, you can live in cache for even more bandwidth and improved multi core scalability.
Ah. I was thinking that the encoding machines were not bandwidth limited but rather cpu limited as they were doing expensive encoding algorithms. So I was thinking the streams were streaming out at less than real time. I figured this was better than the dual/multi encode method I think they are now relying upon when all the detection code doesn’t fit on the same machine as the encoder.
This is less an example of why serverless was bad but rather an example where using non-suitable services for tasks they were not meant for.
In this case they were using AWS Step functions that are known to be expensive ($0.025 per 1,000 state transitions) and they wrote:
> Our service performed multiple state transitions for every second of the stream
Secondly, they were using large amounts of S3 requests to temporarily store and download each video frame which became a cost factor.
They had a hammer - and every problem looked like a nail. In my experience this happens to every developer at a certain stage when he/she gets in touch with a new technology; it doesn't mean that the tech itself is bad - it depends on the scenario, though.
Sending video frames between services is expensive, also doing per state transition hosting on things doing state transitions multiple times per second in a single stream is also expensive...
Like, did they even think about cost when designing this the first time?
Yeah completely insane original design. A design I would expect from a first year intern who is just trying to make his first project work and is picking random technologies to string together.
Considering they don't actually pay the bill for this and it is internal accounting, probably not. Belt tightening has probably pushed cloud providers to figure out if they're wasting stuff they could put to better use, and I assume when it launched and nobody was watching Prime Video, inefficiencies were both smaller and less noticeable.
Oh, absolutely. This makes the Prime Video team look more profitable on paper. But also all streaming services were pretty much launched with an expectation of taking losses for years, so Prime Video being expensive doesn't look unusual for a while. And since it's an internal cost, it's not actually Amazon paying someone, there's really not a significant reason for someone outside the Prime Video team to say "hey, you're too big of an AWS customer".
More than likely, Prime Video making their numbers look better makes AWS' numbers look (slightly) worse, because they're doing a little less business. In the overarching grand scheme of things, this will save Amazon some amount of physical computing resources they weren't getting paid by an outside customer for, but good luck figuring out how much that actual real world savings is.
> The main scaling bottleneck in the architecture was the orchestration management that was implemented using AWS Step Functions. *Our service performed multiple state transitions for every second of the stream*(???), so we quickly reached account limits. Besides that, AWS Step Functions charges users per state transition.
This is so obvious in my head. I can't think of a single good reason where a SFN makes sense here.
I'd be surprised if this doesn't get taken down as it casts AWS lambda in an unfavorable light (and rightly so). That's the impression I have of Amazon's leadership but maybe I'm wrong.
> We designed our initial solution as a distributed system using serverless components (for example, AWS Step Functions or AWS Lambda), which was a good choice for building the service quickly.
The message seems more that they outgrew AWS lambda but that lambda was a good choice at first.
> The post literally says that they could hit only 5% of the expected workload with their server less architecture, so IMO it is still quite negative.
Emphasis on "their server less architecture". Sometimes good tools are used poorly.
For example they describe a high throughout workload, and each workload spread through a bunch of lambdas that handled bite size bits of the workflow. Also, they managed the workflow with step functions. Just imagine the number of network calls involved to run a single job, let alone all the work pulling data to/from a data store like S3 into/out of a lambda. I'd guess the bulk of their wall time was IO to setup the computation.
Of course you get far better performance if you get rid of all these interfaces.
Well they do work for Amazon they can't say lambda sux. Monolith is way faster to develop especially the CI/CD part so no if they started with monolith there would be no downside.
> I’d be surprised if this doesn’t get taken down as it casts AWS lambda in an unfavorable light
“There are use cases where Amazon EC2 and Amazon ECS are a better platform than AWS Lambda” is…not actually a message that anyone involved in AWS has ever been afraid to put forward.
I mean, the whole reason that AWS has a whole raft of different compute solutions is that, notionally, removing any one would make the offering less fit for some use case.
The solution was using a different array of AWS resources so I don't see how anything is being cast in a bad light. Lambda is great for many use cases.
> I'd be surprised if this doesn't get taken down as it casts AWS lambda in an unfavorable light (and rightly so).
The article mostly lays the blame on step functions. Also, lambdas are portrayed as event handlers that don't run relatively often. This means long running tasks that are ran occasionally, or events that don't fire that often. Once throughout needs go up or your invocation frequency comes closer to the millisecond then the rule of thumb is that you are already requiring a dedicated service.
Indeed, it does seem rather ridiculous at face value. On the other hand, I have coworkers that run CPU-IPC bound workloads inside x86-64 docker containers on M1 macs (incurring the overhead of both machine code emulation and OS virtualization). I have other coworkers sweating for hours whether to use 32-bit or 64-bit integers for APIs designed for microcontrollers running at 300Mhz. I have even more coworkers writing stuff in rust because it's "memory safe" and "so fast", but they have no idea that they're doing thousands of unnecessary heap memory allocations per second when I naively start asking questions in a code review.
Even really smart, capable people in general have really poorly calibrated intuition when it comes to the intrinsic overhead of software. It's a testament to the raw computational power of modern hardware I guess. In the case of AWS, it's never been easier to accidentally a million dollars a month.
> AWS Step Functions charges users per state transition
Apparently they didn’t know about the EXPRESS execution model, or the much improved Map state. The story seems to be one of failing to do the math and design for constraints rather than an indictment of serverless.
I have to agree with others - it is amazing this article saw the light of day.
Over 15 years ago now, I was an intern at Toyota. We were working with an in-house python based framework for doing cool/terrible drive-by-wire things with test cars.
I had a project to work around a bottleneck of the framework. It could only process about 70 CAN frames per second before running out of CPU. The vehicle's CAN bus had several thousand per second, though. At the time I was able to fix the problem by adding filtering to the CAN adapter's kernel module.
A couple years later, I worked on replacing the python based framework with C++. I discovered the underlying root cause of the bottleneck. Someone (cough my manager) had figured out a very "pythonic" way to extract bit-packed fields from the 64-bit CAN frame payloads. They converted every 8-byte payload buffer into a canonical binary representation, i.e. ascii strings of 1's and 0's. They then used string slicing syntax to extract fields. Finally, they casted the resulting substrings back to integers. Awesome!
I've since used python many times to process CAN frames in realtime, scaling up to thousands of frames per second without the CPU breaking a sweat. One trick is to use integer bit shifts and masks rather than string printing, slicing and parsing...
Horribly inefficient code is a wonderful thing at a small scale. The faster you solve your problem, the sooner you can solve the next problem.
I once threw together a mylar balloon helium blimp in the shape of a Dragon space capsule. My goal was to fly it over the cafeteria crowd at SpaceX during the C2 launch. For control, I used the PCB of a travel wifi router. I soldered three small DC motors to its LED outputs. The embedded software consisted of something like:
nc -l -u -p 10000 | bash
I then connected my laptop to the access point and ran a python script that would send UDP packets containing shell commands to toggle the LED GPIO pins based on arrow keypresses.
The crowd really enjoyed the novelty. After the excitement was over, I flew it around some more in the cafeteria. Elon Musk walked up to it floating in the air, paused for a few seconds, then looked around the room trying to find the operator. I was just like any other employee hanging out at a table casually typing on my laptop, though.
Good times. On my last day there I still had a helium tank under my desk. So, I filled up a life-sized Elmo balloon (a left over prototype), then let it float up into the rafters of the office. It was presumably up there for a month or two.
> For control, I used the PCB of a travel wifi router. I soldered three small DC motors to its LED outputs. The embedded software consisted of something like:
> nc -l -u -p 10000 | bash
That's a neat idea. Did you have to flash it with a custom firmware or do they typically come with netcat etc installed?
You would probably do a little bit of research after seeing the performance of your code. It's one thing to code the prototype sloppily, it's another to push it to prod.
Lots of opportunities short of rearchitecting: use batching; use multi threading in lambdas; use S3 range requests; use the EXPRESS execution model; etc, etc
Dead horse and all that but please just stick to Boring Tech, it is better for your mental health, not to mention your business, development velocity, defect rate, etc.
Most importantly it's good for mental health though.
Microservices are just an architectural pattern, and like all patterns there are places where they are highly appropriate, and others where they are inappropriate.
Same for cloud, same for <pattern>
If everything is a hammer you'll hurt your thumb/hand/arm.
At least now (for some time) the pattern is named, so broadly when talking about this sort of thing, the name conjures up the same/similar image in everyones heads.
There are all sorts of inputs to the choice of architectural patterns, including budget, scalability (up and down), criticality, security, secrecy, team skills and knowledge, preference, organisational layout, organisation size, vendor landscape, existing contracts, legal jurisdiction ....
As a never been AWS employee I can almost guarantee you the original design was most likely simple and the use of lambdas and step functions a good choice and not expensive but the functionality grew and the cost sky rocketed. This is only normal evolution of a service.
Clickbait title. The expensive part was passing around individual frames and the associated S3 operations. It's not clear if they could've kept a distributed architecture but made the work units be chunks of frames or even whole videos. Monoliths can inefficiently use S3 and other cloud services to rack up a huge bill.
I don't want to come off too harsh on this, but it sounds like the service didn't meet the initial design requirements?
Some of this would have been really easy to predict (eg. hitting account limits) if they simply took the time to calculate how many workflow transitions they'd need to execute for the load.
If you came to me with a design that included passing individual video frames through S3 instead of RAM I would honestly think you were joking. What a wild article.
I’m all for big, fast, monoliths - but I’m not sure I want to hear it from the team that saved video frames to s3 in their AWS Step Function video encoder.
AWS is truly a customer first company. I been AWS customer in its early days (2006-2012) and then recently (2022-now). And they have been consistent in being customer-first. In the last year, they have proactively helped us cut our AWS spend by multiples. I'm not surprised at all by this article coming from within Amazon. Kudos for maintaining such a culture.
The headline is a bit of a misnomer. This happens in large businesses all the time (which isn't to say it's "good", hardly is, but it suggests the causation is incorrect here, which then indicates the conclusion is entirely off-base):
1) We have sexy new product! Everyone use it so we have some use-case stories to tell and we look credible! Who cares if it's not the right tool for the job! We need a splashy way to use hackneyed business speak like "we're eating our own dog food" at the next user con so all the IT middle managers there will fight over early access and adoption. PROFIT! (Screams of technology teams in the background of "a knife is the most expensive, useless pry tool you can buy, but whatever, you are not listening, mmmkay").
2) A few quarters/years later (if you're lucky and you made it or someone with enough gravity in their title finally saw the light): Why is expense so high in this business unit? This is insane! Let's go back to a more sane architecture. (Screams of technology teams going back to what was working in the first place, but was not sexy nor necessarily new now that no one is watching and hype cycle is over)
Does this mean that serverless is useless? Dumb? Uneconomical? No way. For bursty, very short running workloads, it can be GREAT and INCREDIBLY economical.
What is useless and "dumb" is whomever thought that Prime Video's encoding workloads were going to do anything but increase cost and were somehow a fit for a system whose business case specifically necessitates bursty, shorter workloads that are primarily scale-to-zero for significant periods of the day/week/month.
It was a marketing stunt gone horribly wrong: intentional or not, but that doesn't repudiate the value of "serverless" for the right workloads, it just proves you better really understand the technology and the business case and the scale economics, and that goes for any technology.
amazon product ditches amazon product for another amazon product?
feels very strongly they just moved from one AWS platform to another.
delay between asynchronous communicating processes differs in these architectures and I suspect they were unable to orchestrate microservices to match the RPC "inside" a monolith model. Nobody can: It only matters if your IPC is causing delay you can avoid.
Most of us aren't in a room where the real cost is high: 90% of computers are more than 90% idle 90% of the time. Amazon is not in that cohort.
I think what most people are missing here is that they used AWS Step Functions in the wrong place. Part of the blame here is that in over enthusiasm of trying to get more users, AWS doesn't properly educate customers when to use which service. Worse, for each use case AWS has about dozens of options making the choice incredibly hard.
In this case, they probably should have used Step Functions Express, which charges based on duration as opposed to number of transitions and they're looking for "on host orchestration" like orchestrate a bunch of things which usually are done in small time and are done over & over many times. Step functions is better when workflows are running longer, and exactly once semantics are needed. Link for reading differences between Express & standard step functions: https://docs.aws.amazon.com/en_us/step-functions/latest/dg/c....
This also exemplifies the fact that I learned while being at Amazon & AWS that Amazon themselves dont know how best to use AWS. This being one of the great examples. I'll share 1 more:
- In my team within AWS, we were building a new service, and someone proposed to build a whole new micro service to monitor the progress of requests to ensure we dont drop requests. As soon I mentioned about visibility timeout in SQS queues, the whole need for the service went away. Saving Amazon money ($$) & time (also $$). But if I or someone else didn't mention, we would have built it.
I dont think serverless is a silver bullet, but I don't think this is a great example of when not to use serverless. It helps to know the differences between various services and when to use what.
PS: Ex Amazon & AWS here. I have nothing to gain or lose by AWS usage going up or down. I'm currently using a serverless architecture for my new startup which may bias my opinions here.
Worth mentioning as mentioned in other comments that moving video data around at that scale was a bad choice to begin with. They could have considered fargate and avoided moving the data around so much as well and realized similar reductions in cost. So the wins are not really coming from moving to monolith as much as they're coming from optimizing unnecessary data transfers.
If the article said fargate, which is technically still serverless we could have avoided a whole microservice vs monolith debate or serverless vs hosts/instances debate.
I work in streaming video, specialise on AWS, and have enjoyed using Step Functions for certain (non-video) projects. I am _astonished_ that Step Functions + S3 was even considered as a starting point for defect detection in streaming video. Astonished.
> Moving the solution to Amazon EC2 and Amazon ECS also allowed us to use the Amazon EC2 compute saving plans that will help drive costs down even further.
So various parts of Amazon have to work through the AWS same pricing programs that the rest of us do?
This. All trends are cyclical. Microservices have a purpose. Monoliths have a purpose. They are not mutually exclusive. One is the path to the other but there may also be resets along the way. I spent 10 years doing microservices and now I'm back to a monolith. It's a refreshing change but it's also a project in its infancy. Breaking that out over time will only happen as and when needed.
If your architecture has a high cost to develop, test and run when a cheaper architecture meets your needs, it's a sign that you have overengineered. In my experience there is an order-of-magnitude increase in complexity by adopting microservices that only starts to pay off when your org and user base are huge.
I think that would be done via code creation unless the function needs LLM qualities. But LLM TDD where both the tests and code are autogenerated could be a thing for sure. And it will be microservices so that each service is easy to generate by LLM!
> ChatGPT: Yes, Avogadro's number is even. The value of Avogadro's number is approximately 6.022 x 10^23, and since it ends with the digit 2, it is an even number.
Somewhat interesting article, but this isn't a monolith, at least not by a microservice fanboy definition.
The product (Prime Video) is still built using many business oriented services. Furthermore, this service appears to be developed and operated by a single team.
That being said, there are some lessons here - there are good ideas in most design paradigms, but if you take them to the extreme, you're going to see some weird side effects. Understand the benefits and engineer a balanced solution.
I think serverless has its place, but this problem doesn't seem like a fantastic fit.
We are looking into serverless as a way to exhibit to our customers that we are strictly following certain pre-packaged compliance models. Cost & performance are a distant 2nd concern to security & compliance for us. And to be clear - we aren't necessarily talking about actual security - this is more about making a B2B client feel more secure by way of our standardized operating model.
The thinking goes something like - If we don't have direct access to any servers, hard drives or databases, there aren't any major audit points to discuss. Storage of PII is the hottest topic in our industry and we can sidestep entire aspects of The Auditor's main quest line by avoiding certain technology choices. If we decided to go with an on-prem setup and rack our own servers, we'd have to endure uncomfortable levels of compliance.
Put differently, if you want to achieve something like PCI-DSS or ITAR compliance without having to covert your [home] office into a SCIF, serverless can be a fantastic thing to consider.
If performance & cost are the primary considerations and you don't have auditors breathing down your neck, maybe stick with simpler tech.
Overall, like it's stated in the article, it would be a case-by-case choice what to use.
My experience tells me it's always a good idea to start with the monolith but I don't know much about PII to tell you your idea is over-engineered.
I feel there are better ways though.
Also because you don't need to use Lambda to not be on-prem EC2 is enough.
I am big fan of django's apps model ... what I like to call a "Modular Monolith".
Being an early engineer at most of my stints, I have build and scaled multiple startups using the approach and it has never failed me, the pitfalls of micro-services is not worth it unless absolutely necessary.
I always made it a point to group by business-logic rather than separate at whatever curve ball "new-tech" throws at me.
Two naive ideas that may be OK as a going-in position:
- granularity
- bandwidth negligibility
Breaking everything down to a gnat's ass might improve testability, but is testability the product? Do I really need a Java stack trace that reads like an Andrew Wiles proof?[1] Maybe I do, at scale.
Then there is the non-zero cost of the packet shuffling. Every edge in the aechitctural graph, not just the nodes, costs. But we just throw a waiter into the code and move on to the next line. No biggie.
What was most interesting was "It also increased our scaling capabilities." Granularity was supposed to let "serverless" absorb the entire universe, I thought.
At a higher level of abstraction, maybe The Famous Article is a map/reduce job: the requirements dissolved into solution, and a proper number of components precipitated out.
> The second cost problem we discovered was about the way we were passing video frames (images) around different components. To reduce computationally expensive video conversion jobs, we built a microservice that splits videos into frames and temporarily uploads images to an Amazon Simple Storage Service (Amazon S3) bucket. Defect detectors (where each of them also runs as a separate microservice) then download images and processed it concurrently using AWS Lambda. However, the high number of Tier-1 calls to the S3 bucket was expensive.
Taking "malloc for the Internet" [1] a bit /too/ literally there.
Seems somewhat curious that they didn't at least include Fargate. Feels like they jumped all the way from the typical overengineered setup into using AWS in a way that's very close to just "I need virtual machines".
I've never seen successful micro services if the starting point is not a monolith. The most successful ones I've seen are hybrid ones where some parts needed to be scaled are refactored as a micro service to run in parallel.
Bang on. A friend I work with used to say "microservices are for scaling teams, not tech" which I liked.
Even with monolith -> microservices I've seen it go wrong. One Go application I worked on it would take a senior engineer a week to add a basic CRUD endpoint as the code had been split in to microservices along the wrong boundaries. There was a ridiculous amount of wiring up and service to service calls that needed done. I remember suggesting a monolith might be more appropriate, and was told it used to be a monolith but had been "refactored to microservices"...
This type of stuff can literally kill early stage companies.
Shipping around individual video frames between components is really an astonishingly bad idea.
Microservices seem to be a decent idea with a terrible name. The idea of running services that are small enough that they can be managed by a single team makes sense - it enables each team to deploy their own stuff.
But if you break things down further, where you need multiple "services" to perform a single task, and you have a single team managing multiple services - all you do is increase operational & computational overhead.
The only time it makes sense to use edge/serverless anything is lightweight APIs and rendering HTML to end users so they get the page loaded as quickly as possible. That's the only use case good for edge. And any supporting infra that can help deliver rendered pages asap (like kv store on the edge for storing sessions, lightweight database on the edge for user profile data, queues etc). Anything that requires decent amount of processing should not live on the edge/serverless. It defeats the purpose.
> The only time it makes sense to use edge/serverless anything is lightweight APIs and rendering HTML to end users so they get the page loaded as quickly as possible. That’s the only use case good for edge.
Nope. Edge is just serverless that is closer to your user to reduce the number of network hops. Both are essentially the same when it comes to technical functionality. They run on limited resources and should not be used for intensive workloads.
You are just getting unnecessarily pedantic here. I was talking about computing resource usage. Both are the same when it comes to resource consumption being limited.
It is like saying Oracle/Postgres/MySQL/MSQL. If you say they are different in some X functionality, yeah duh they are different in X functionality. However, they are all SQL databases.
Same way, Edge/Serverless is both running on limited compute resources (which is the point of the article and point I was making). Both differing in functionality X (of latency/closeness to your user) has nothing to do with either the article or my answer.
I'll launch a consulting business focused on migrations from microservices to monoliths and from the cloud to in-house. Pricing would be a % of the saving over the first year.
I'm happy to see -- in the discussion here-- the continued backlash against microservices and the deleterious effects it has had on software complexity, and data modelling.
But I think it's interesting that if we took a time machine back to 2014 or 2015 the tone here would be quite different, and microservices were all the rage on this forum as I recall.
I like to hope that the industry learns from its failed trends, but I'm now old enough to see this is rarely the case.
These days when project managers of new products seek my advice as a solutions architect I tend to suggest they create a minimally viable product that is written modularly so it can scale, but deploy it very simply on a few servers just like we used to 15 years ago.
Scaling is definitely a good thing, microservices make scaling easier, no doubt about that. But an MVP rarely needs k8s level scaling, it just needs to be written well so it can scale in the future.
I've been having lots of thoughts lately about how you build a) a system that can respond to scale b) for the affordable price possible c) scaling infrastructure spend with income
I love the anecdotes about just buying a Hetzner server which can handle a surprising amount.
One of my ideas is a company that maintains an incremental infrastructure that can grow to handle extreme levels of traffic - the infrastructure itself mutates over time.
Breaking things into tiny functions and putting them on many different servers incurs tradeoff costs in both complexity and compute. There is a complexity cost in having to deal with the setup, security, and orchestration of those functions, and a compute cost because if the overall system is running constantly it will be less efficient and therefore more expensive than running on one box.
Good point. "communication" should also be on the list. I don't think storage is technically the tradeoff in this case even though it's S3. It's the traffic between those components that's costing them.
Rarely will I defend Amazon in anything, but I'll make an exception.
In my experience, AWS/Amazon people do not force you or even direct you to a particular architectural choice. They are relatively indifferent about it.
Instead, trend-driven architectures seem to come from the tech community themselves. It's the customers often making the wrong choice.
Perhaps Amazon reached peak saturation for its video streaming services so it no longer needed unknown unknowns from holding it back from using a more efficient monolithic architecture. Distributing services across multiple machines is certainly more scalable but all those API calls can add up.
Over engineering at its best. I tend to see microservices as a doubled edged sword and in this case, there was no need for them.
Also, the pricing of AWS quickly goes up as you go from EC2 -> Fargate -> Lambda. I don't know why on earth someone would build microservices at the lambda-level.
They basically underestimated the cost of moving millions of small files to and from S3; it kinda makes sense if they want to save those images for a long time, but in this case it was for semi-real-time error detection, which is much faster to do in-memory.
I wonder if this is, in some way, a kind of signalling of where AWS wants to go – maybe they want to shift more towards dedicated hosting rather than all of these separate services?
Of all the streaming services that have irritated me, I can't recall any serious technical problems with prime. I suppose I have a vague memory of poor AV sync that could have been on prime, it was always a problem at the start of streaming that would work itself out after a few seconds.
Netflix's shiny new compression scheme a couple years ago didn't work on my Sony TV's buggy silicon. The only way I got that fixed was by knowing someone on the inside.
Hulu usually can't make it through an episode without the video freezing at least once. Sometimes it just refuses to work at all until I completely reboot the TV.
HBO Max's UI is just really cheesy and slow, but whatever it's fine.
Paramount+ is my new favorite to hate on. The UI is maddeningly glitchy and lethargic. I pay for no ads, but it plays ads anyway, on Star Trek episodes from 1996. It doesn't remember progress in a show more than once every week or two, just enough to remind you that it's supposed to be a feature. On my phone, it doesn't hide the typical menu overlays unless I do a complex sequence of finger taps. One time I tried to file a bug report from inside the logged-into app, and I got an email back claiming that they would love to consider my concerns but can't because they don't have an account associated with my email address.
For me, the sync is fine at the start of playback on PrimeVideo. It just becomes bad progressively (which leads me to believe they have used a video framerate that is ever so slightly different from the source and have keyframes insertion after a longer than optimal duration; similarly sample rate mismatch for output audio relative to input audio stream could be a potential cause).
And I use a FireStick, FWIW.
BTW, their own trascoder product MediaConvert seems to have this issue (It is possible that it could be user error too in how they have used the product or setup the parameters). [1]
My guess is PrimeVideo dogfoods MediaConvert and they also have this issue. They could have fixed it for newer content, but previously transcoded content still has issues (which will remain until they are re-transcoded).
I guess what AWS sells is not servers, but software to manage them automatically, to load balance, to replicate etc. Once, in a short time, GPT can write such (pretty standard) software for you, Amazon will, too, go down.
You're vastly oversimplifying this, imho. It's not just being able to write something and get AI to write terraform for you (it doesn't do it all that well atm in reality, for anything complex). You can't automate the people who you need to convince to make those decisions internally, on the whole, at least :)
I wouldn't call it a monolith as the number of instances could be scaled up. Mono implies single instance. They just combined multiple microservices into a larger one.
To most people, "mono" refers to
a single codebase, not a single deployed instance. I've worked on many monoliths that run multiple instances in production.
Microservices are no more or less scalable than a monolith. The main benefit of Microservices is allowing multiple teams to work independently from each other without everyone "stepping on each others toes". You can have scalable monoliths and unscalable microservices.
> Microservices are no more or less scalable than a monolith.
This is not fully true. A microservice architecture is more finely scalable than a monolith.
To take a very basic example, if you have a peak of users watching a video you can scale up the microservice dedicated to serving videos, but not scale up the service dedicated to users signups, which isn't having an increased load.
> you can scale up the microservice dedicated to serving videos, but not scale up the service dedicated to users signups, which isn't having an increased load.
No, splitting a codebase does not magically make it more scalable in production. You still have to prove that the authentication component would create significant unnecessary load if it was scaled up together with the video service.
> A microservice architecture is more finely scalable than a monolith.
Apologies, but I strongly disagree and I'm going to go on a bit of a rant here....
This is a myth, and one of the reasons people are making these ridiculous architecture descisions. If you have a monolith that serves videos and enables signups, you can deploy as many instances of that as you like based on the highest need. It doesn't matter if user signups are a fraction of video watches, it just means that your user signup endpoint is not getting called as much. Maybe you're deploying a larger codebase than you need to but that's hardly a downside.
In your example, let's say we have 2 endpoints that are behind a gateway or L7 LB so that we can point them at different codebases if we like:
- videoservice.com/signup
- videoservice.com/watch
If I'm geting 100k rps to /watch, and 100 rps to /signup, I can just deploy loads of instance of my monolith behind the /watch endpoint. Maybe that monolith contains code for /signup, but it's not going to get called. So what.
I've seen this approach used in many places. You don't need to split the code to do this at all. Sure it might feel "cleaner" to you to do this, but it's not needed.
Now, you may get to a point where your deployment is really heavy and time consuming and you don't want to deploy everything just to scale up /watch - but again I'd argue that is not really anything to do with scalability, it's about being able to deploy things independently. Using a microservice doesn't make your service more scalable here, but it might make it easier to deploy.
Microservices are nothing to do with scalability. They are about how you organise code and teams to achieve better development velocity.
> Microservices are nothing to do with scalability. They are about how you organise code and teams to achieve better development velocity.
I don't think this is strictly true; even though microservices are usually used that way.
Scaling up everything even when not needed has it difficulties. You can have lots of unnecessary initialisation tasks, lots of unused caches warmed up, database and socket connections that are not needed, complexities in work sharing algorithms etc.
Thank you for clarifying. It's seems my definition doesn't correspond to others'. Then do we lack a word for a monolithic application that handles everything and that can only have one instance running?
It’s also not really serverless to begin with, because at the end of the day code is being executed on a physical device that many of us might call a “server”
I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.