It amazes me they have 1,700 services. It would be hilarious satire to actually see the description of each. And the debugging scenarios he listed make it sound like they have very poor engineers working for them. Who on earth lets an application get to prod that iterates through a list of items and makes a request for each thing?
When did we loose our heads and think such an architecture is sane? The UNIX philosophy is do one thing and do it well, but that doesn't mean be foolish about the size of one said thing. Doing one thing means solving a problem, and limited the scope of said problem so as to have a cap on cognitive overhead, not having a notch in your "I have a service belt".
We don't see the LS command divided into 50 separate commands and git repo's.....
While I'd concede that 1,700 services is a lot, and probably a function of the fact that they're hiring too many engineers too quickly (there's no possible way to indoctrinate people into sane and standardized engineering practices at this sort of hiring growth), I don't actually think what he's describing is that unusual for a company at Uber's scale or experiencing Uber's rate of growth.
I work at Airbnb and while we have many fewer engineers and services than Uber does, most of the issues he talked about in the talk resonated with me to one degree or another.
Saying something like '...they have very poor engineers working for them.' is pretty unfounded, and naive. It's easy to say that having 1700 services is overkill from our point of view, but we don't know the complete architecture, problems being solved and environment that they operate in.
One thing to possibly consider is that once you have set up tooling for a platform; logging, request tracing, alerting, orchestration, common libraries, standards and deployment. Deploying and operating new services becomes relatively straight forward.
Saying that, 1700 is a lot, which makes me intrigued to see inside uber.
I'm currently working with a very (very) large UK-based retailer, with various outlets globally, and I can tell you from first hand experience that their selection process is ruthless. There are no, or at least very few, lamers in the tech team.
And yet, when you look at the systems, and the way things are built, some of it just seems crazy. Only it's not, and for many reasons, of which here are only a couple:
- The business has grown and evolved over time - with significant restructuring occurring either gradually, or more quickly and in a more intentionally managed way; systems have grown and evolved with it, often organically (which is something I see at every company I've worked with).
- Legacy systems that make it really hard to change things, and migrate to newer and better ways of doing things because they're still in production and depended upon by a host of other systems and services.
- The degree of interconnectedness between systems and services is high across the board; this isn't a reflection of bad design, so much as a reflection of the complexity of the business.
In any large organisation these kinds of things are apparent. I would imagine that in a large organisation that has become large very quickly, like Uber, if anything the problems would be magnified.
To say that they have poor engineers is therefore very unfair: these are people who would have needed to ship RIGHTNOW on a regular basis in order to facilitate the growth of the business. That's going to lead to technical compromises along the way, and there's little to be done to avoid it. There's also a difference between being aware of a problem and having time, or it being a priority, to fix it (e.g., iterating and retrieving).
This wasn't a derivation from the repo count, it was in response to the debug part of the talk where he was talking about fan out. Some app was iterating through a list of things and for each thing was making a network request, instead of figuring out what it needed up front. It's either complete laziness or incompetence, probably the later. I assume with any company that has that many services their culture is partly to blame, too.
It's interesting because this kind of technical debt will eventually open the door to competitors once Uber is under more financial pressure...i.e. they have burned through all the investor money.
I agree that it is utterly stupid. However, it is almost certainly a political / interpersonal / interteam issue. Very probably, the person making the calls couldn't or didn't convince the service they were calling that a bulk access mechanism was required.
It probably doesn't help that canonical REST as usually applied to APIs doesn't really have a single well-known pattern for this.
We see this pattern at PagerDuty over the majority of our customers. There is a definite lull in alert volume over the weekends that picks up first thing Monday morning.
It's led to my personal conclusion that most production issues are caused by people, not errant hardware or systems.
After working at various enterprises over the years (where deployments are slower in some cases) I've noticed you'd do a Thursday/Friday/weekend deployment, everything "looks good" and you'll still have a bunch of issues Monday morning due to users finally using the system en masse.
Pretty sure the user volume for Uber is higher over the weekends than during the week. Particularly system stressful times are 1:30-2:15am PT on Saturday and Sunday mornings.
While a lot of people go out on Friday and Saturday night, I don't think the total volume surpasses commute volume. The surge in pricing you experience late at night is largely due to the number of drivers on the road, which would be less given the time of day.
Edited out consideration: traffic in your area might be a good proxy for Uber volume. Traffic late at night is generally low.
Your statement is true in San Francisco, but Pacific time zone also encompasses Oakland, south bay, los Angeles, and san Diego, where relatively few people commute using Uber compared to going out using Uber.
From personal experience, having driven for Lyft and Uber I'm two cities (Sf, sd) surge in SF does get acutely high in the morning, more broadly in the evening (work departure times are spread out over a wide range of hours), and acute around 2am. In California, bar closures statewide are 2am at the latest, so that accounts for unified departure times in multiple markets.
Absolute numbers of alerts are probably a lot less useful than alerts/use, or alerts/users. After all, if people use the services PagerDuty covers less often on the weekend then you might see a lower alert volume even if issues are relatively more common.
I'm bad at optimism, but maybe the takeaway here is that we're finally at the point where it's not the hardware or systems that are the biggest problem. The automation is actually working well enough that it's not the tall tent pole.
I can see the argument that if releasing causes things to break then don't release so frequently, but in practice the end result of that is lots of things breaking at once and having to unpick everything. Debugging is much easier if you're debugging a single change fresh in your mind.
You are right. Making small changes that can be isolated for debugging purposes is a good approach. What I mean is that we should always question "best" practices and how we apply them to our development process. These days there is a tendency to drink the kool-aid (guilty of this as well). DevOps is something that we are still learning and developing as a profession.
Yes, but now we actually try to automate in a systematic way; as opposed to a single admin hacking some custom shell scripts. That's what the dev in devop mean. And that's new, at least it was not a common activity in the XX century.
If by "systematic way" you mean a group of programmers clueless about system
administration hacking together some custom Ansible scripts, then I wouldn't
call it progress.
Sysadmins have tools to automate their work for a long, long time (cfengine,
bcfg2, even Puppet and Chef predates DevOps hype). DevOps didn't bring
anything new to the table.
Is not about doing it right or wrong. Is about the automation that is reused cross-project and the fact that a lot of things that used to require a sysadmin now we automated its job away. If developers implement it correctly or poorly is a different issue.
It is also not about hype (or not). I do agree that the name is posterior to the beginning of the practice, but it is the name that we have.
Being a sysadmin was always about automation. DevOps brought nothing to the
table about that. Neither tools nor paradigm. And the name is just another one
for "administering systems", if we keep what you seem to mean by DevOps.
It's still no progress, hacky scripts written by sysadmins or similarly hacky
scripts written by programmers, and systematic way of automating tasks was
available to sysadmins and was used by them for a long time. DevOps brought
nothing new to the table.
I've always had the imoression that a sysadmin ( a position I deeply respect) was more comprehensive than DevOps. DevOps mostly focuses on automating the infrastructure of software development. A sysadmin does that and more.
> It seems like so many "best practices" are really thinly veiled attempts at exploding complexity
I'll bite. :D
It's a bit more nuanced than that IMO, the "deploy often" mantra is only as good as the process around release + deploy. If you half-ass testing and push to production without without a process for verification -- or if, say, your deploy process is half-baked, or your staging environment is worthless -- you can probably expect "interesting" production deploys on a pretty regular basis.
As much as we'd like to pretend we're all good engineers, this happens more often than you'd expect -- even with good engineers: at some point a company transitions from scrappy startup to a shambling beast, and the things that used to work for a scrappy startup (like skimping on testing and dealing with failures in production) are insufficient when you've got more eyes on the product. Further, the engineering culture remains stuck in "scrappy startup" mode long after the shambling has begun.
And all that's ignoring the fact that less frequent deploys with more changes have their own set of problems. We actually got to a point with deploys of a certain distributed system such that we were terrified if we had more than a few days worth of changes queued up. So many things that could go wrong! :)
> We create ourselves so many of the problems we are paid to solve.
This, on the other hand, I completely 100% agree with: if not us, then who? :)
The breakage rate per new feature is fairly constant; if you release 7 new features once a week or 1 new feature once a day, you will have the same number of issues. The question then is; is it easier to deal with all the issues at once, or a smaller number of issues every day?
In addition to that, I would also worry about interactions between issues. I tend to lean towards spreading out the issues over time to make the eventual diagnosis easier.
Occasionally we'll have a problem where we cannot deploy to production for several days (normally it's once a day). A massive inventory of ready-to-deploy features builds up. When we do finally deploy, this deluge of features and fixes creates new issues that force a rollback, delaying the deploy even longer...
Funny, I've taken it as justification for releasing often. If I have a hard time changing one thing without breaking the service, it's nearly impossible to change a hundred things without breaking something. Since it's a given that I will have to change things, I'll try to stick to a scope where I stand a chance of doing so successfully.
That conclusion is well founded. We correlated issues at Blekko across a lot of different factors, the one that always held was code or configuration changes. Not too surprising in the large but definitely confirmed by the data.
You laugh, but a lot of time the best thing for a software product is to not make so many damn changes. People tend to over-estimate the value of features and tweaks and under-estimate the risk of things going wrong.
Oh I know. I've been in a meeting with several bosses where I tell them flat out "don't give me more developers, give me more hardware and tools". They usually look like I have an extra arm growing out of the top of my head.
That does save a lot of money :-) The trick is managing the rate of change and the risk of disruption. If you manage it to no risk you end up changing too slowly, if you manage it to close to the risk you end up with unexpected downtime and other customer impacting events. Understanding where you are between no risk and certain doom really only comes with experience.
Hospital mortality rates in each department are also lowest when there's a conference for that specialty nearby. The doctors all go there, few/no routine surgeries are done that day, far fewer people die.
That study gave no causal reasons. In particular, there was no effect on number of surgeries done. It's just as possible that research-oriented doctors go to the conference, and those doctors are worse at performing surgeries.
Isn't that because a surgery, if it fails, will result in the patient dying immediately, rather than some random time later? If you do more surgeries, you will increase both the failed and the successful surgeries, whether your success rate is 10 or 90 percent.
Compare: "bomb explosions tend to increase around the time when we send in the bomb squad. I guess bomb defusing is pointless."
IOW: Surgeries and bomb defusing tend to move forward a lot of the death-probability-mass (while, in theory, destroying some of it).
Yep. "Move fast and break things" is immature and inappropriate for anything that purports to be reliable. It's fine for toys like social apps, but for serious actual grown-up applications it's a juvenile hindrance.
In an Agile context at least, it means test out ideas and reiterate quickly. Breaking does not necessarily mean breaking existing things in a visible way.
I see how one could interpret it that way. I meant to imply rather that software generally needs to be regularly changed in order to continue providing value to its paying customers.
At minimum, you've got keeping up with security patches, library/framework deprecations etc. Software which has literally not been changed in years is often an insecure timebomb, with unpatched vulnerabilities, and as soon as you do need to change it, you're stuck because all its dependencies are deprecated. Besides, you also need to keep up with the market, competitors, customers' usage patterns changing, etc.
Ok, that's a reasonable position, though not one I completely agree with.
It would be nice if we as customers had the freedom to pick up which upgrades to install, and to limit ourselves to security patches and (maybe, some) library deprecations. Windows upgrades let you do just that, even if it is a pain to manage. In my previous job, we held monthly meetings to discuss which windows patches to install in our dozens of business critical Win2k3 servers.
For most software, upgrades are opaque. In the best case, you are given a binary choice: install now or install later. In many cases you dont even get that, either the product stops working until you apply the patch or the patch gets automatically applied without your explicit consent.
Which would itself not be a problem if most upgrades did what they are ment to do. Instead, as the GP suggest, patches introduce new (sometimes critical) bugs all the time. With pure repair patches, the balance is mostly positive siince they fix more bug than they break, but with every new feature (which usually is required by a minority of very vocal customers) comes a risk that some other more important feature is going to stop working as expected.
Industry wide, we lack the discipline to change things in a responsible way. Change management is today what source version control was back in the 90s: something that most people has heard good things about, but most people is not doing correctly, and a significan minority not at all.
It's a bad analogy because they don't upgrade ships while they are in use. People aren't saying software is most stable when it's not used, they are saying it's most stable when people stop changing it.
I think it's OK to move fast and break things when you're in the initial product development stage, where you need to develop features quickly and try lots of new stuff to see what sticks. Once you have paying customers though, it may be time to slow down and start caring about stability, at least for your main branch.
It's also ok to "move fast and break things" when the most critical thing you're breaking is your annoying high school "friend"s ability to share the latest racist meme with you. There's a reason facebook made this phrase popular.
I hear what you're saying, in theory. In practice, a system is always changing – regardless of whether developers are deploying code changes.
Some events (or "changes") I've seen in last year that caused a system to stop running fine:
- analyst ran query that locked reporting DB, which had ripple effects
- integration partner "tweaked" schema of data uploaded to us
- NIC started dropping packets
- growing data made backup software exceed expected window, locking files
- critical security patch released (didn't cause a problem exactly, but did change the system from having no known security holes to having a big one)
On and on. So, yeah, I'm not disputing the idea that changes are often the source of problems. I'm just saying that any moderately complex system is constantly encountering new events whether or not the developers are making changes.
This isn't a rebuttal of your statement so much, as it is rebuttal of a common attitude I see in business folks. A lot of non-technical executives seem to have the mentality that software is "done" after it's built, which is naive IMO.
Active systems require active maintenance. You can avoid a lot of problems with intelligent architecture and robust instrumentation/monitoring, but at the end of the day systems rot and will eventually stop running fine.
And even if your perfectly planned and architected system runs in total isolation on a private server where security isn't a major concern, you'll still build up technical debt if you aren't routinely upgrading major libraries, etc. You'll get to a point where you want to implement some feature, and while there are 3 excellent OSS libraries for your platform to accomplish it, none of them is compatible with your 5 year old version.
You probably know all of this and might even agree with most of it. I just had a visceral reaction to the idea that "a system running just fine will usually continue to run just fine," and felt compelled to respond. :)
Note: most of my experience is with public or enterprise web applications. I imagine other types of systems have other problems.
Or until the power is cycled. That's why it's best to do the crazy monkey or whatever it's called, walk into the production offline datacenter, online for testing (you do have geographically redundant datacenters right?) and pull out a half dozen random NIC cables. You can simulate with software tools but nothing beats the real test.
I think they meant that Netflix Chaos Monkey is a "simulation" of the real-deal, which is sending a shaved-ape into a physical server room and pulling out physical network cables. I didn't read it as derogatory.
Does anyone who has worked in software for any period of time not realize this? Computers are dumb they (generally) repeat the same tasks reliably _compared to humans_. Occasionally there are system failures but most failure are caused by code or configuration changes, especially when they interact with inputs and systems in untested ways.
Just to confirm, 1000 microservices in this case is 1000 different apps (e.g.different docker images) running simultaneously? 1000 microservices in this case not 1000 microservice instances (e.g. docker instances)?
If it is 1000 microservices as in different apps, then they must have at least 2000 running apps (at least 2 instances per app for HA).
Maybe uber only have 200 "active" microservice app running at the same time where each microservices have N running instances.
I just cant imagine running 1000 different microservices (e.g different docker images, not docker instances) at the same time.
Our number is closer to 1,700 now, but yes this means 1,700 distinct applications. Each application has many instances, some have thousands of instances.
Could you give some information as to what the breakdown of functionality is for those services? I can't fathom 1700 different and unique pieces of functionality that would need to be their own services.
I work at a microservice company. An example microservice is our geoservice which simply takes a lat/lon and tells you what service region the user is in (e.g. New York, San Francisco, etc..). You can see how dozens of these services might be needed when handling a single request coming in from the front end or mobile apps. The service may eventually gain another related function or two as we work on tearing down our monolith (internally referred to as the tumor because we do anything to keep it from growing...).
You might want to add functionality such as caching or business rules.
A better question would be why not write a module or class? There a pros and cons to either, but advantages include: better monitoring and alerting, easier deployments and rollbacks, callers can timeout and use fallback logic, you can add load balancing and scale each service separately, it's easy to figure out which service uses what resources, it makes it easy to write some parts of application in other programming languages, different teams can work on different parts of the application independently as long as they agree on an API.
Strictly this is not necessarily relevant. You can easily roll that table hit into another db hit you were already making. Can't do that with services.
Yes, but instead of making it a whole new service, you are probably already using a database and can use that service for this functionality as well.
But since asking the question I've realized that if your application already needs a huge amount of servers because it simply gets that much traffic, then putting something like this in its own docker instance is probably the simplest way (it might even use postgres inside it), if those boundaries change now and then.
I understand what microservices are, but I can't understand what 1700 pieces of unique functionality of Uber could be abstracted into their own services. I am struggling to even think of 100, so I was curious what exactly some of these things were, and how they structured things to need so many service dependencies.
It looks that for functions as simple as that the RPC overhead should be pretty small, or it will eclipse the time / resources spent by the actual business logic.
E.g. I can't think of a REST service doing this; something like a direct socket connection (over a HA / load-balancing level) with a zero-copy serialization format like Cap'nProto might work.
Whoa. That's is an insanely large amount of applications. I'm assuming that's essentially one function per microservice which is one of the huge do-not-do of microservices as it's just a complete waste of time, resources, etc.
I would love to hear a breakdown. This sounds like a nightmare to maintain and test.
>>> Our number is closer to 1,700 now, but yes this means 1,700 distinct applications. Each application has many instances, some have thousands of instances.
Time to forbid adding more stuff and start cleaning.
(12) [...] perfection has been reached not when there
is nothing left to add, but when there is nothing
left to take away.
Why? I'm assuming their highish engineer/service ratio is because their services do less individually.
Anecdotally, I've worked on services that ran tens of thousands of instances across the world. You build the tools to manage them and it works very well.
People talk about interpreted languages being slow sometimes, now your program is divided over 1700 separate servers and instead of keeping variables in memory you have to serialize and send them over the network all the time.
Maybe my ignorance, but do they have services like left-pad-as-a microservice? I can't understand which 1000 microservices you can derive for a cab-renting application.
I used to work in startups, and wondered this all the time. Then I joined a big valley tech company, and now I understand.
It's because they hire smart and ambitious people, but give them a tiny vertical to work on. It's a person level problem, not a technical one in my opinion.
I think you solve this by designing your stack and then hiring meticulously, instead of HIRE ALL THE ENGINEERS and then figure it out (which is quite obviously uber's problem)
I'm not sure if it's ignorance that leads people to dismiss things they don't understand so much as it is a coping mechanism similar to that which gave birth to various religions. Ignorance just lets it thrive after the birth.
Phrasing aside, it's a legitimate question. I've worked at a really really large web behemoth serving orders of magnitude more users and many many unique and disparate products (as opposed to a handful products that Uber serves). If I counted all of their production related services, I'm not sure they'd amount to anywhere close to 1700. Now I know "microservices" is the new hotness, but surely there are limits to human cognition that limit the total number of moving pieces any team can manage without maintaining a constant state of crisis.
It'll be interesting to know the number of people oncall at any given time and the number of prod alerts per hour/day/week.
I have heard of, but never witnessed, groups where the team stays together but cycles through projects a sprint at a time. One or two teams keeping three or four projects spinning but making the projects take turns.
I don't know if they provide popcorn at the meetings where the project managers explain why they deserve the next sprint.
Nonsense. libc? A microservice for every function call! Surely if you can wrap your head around a library you can wrap it around a thousand little microservices.
I think these "alarming trends" are highlighting that operational complexity is an easier pill to take than conceptual complexity, for most workers in the field.
Microservices address that gap.
And in the process the field is transformed from one of software developers to software operators. This generation is witnessing the transfer of the IT crew from the "back office" to the "boiler room".
Do you think that's it's that people would rather deal with operational complexity, or just that it's easier to not think about operational complexity early on ("running locally, with low volume, everything works together great!")?
Personally, I vastly prefer debugging and building on non-microservice architectures than on something split willy-nilly into dozens of services (because most implementations I've seen don't do microservices with clean conceptual boundaries - it's more political/organizational divisions that determine what lands where, not architectural concerns).
>>> Do you think that's it's that people would rather deal with operational complexity, or just that it's easier to not think about operational complexity early on
That's one way to scale development projects.
Have multiple teams of dev who work on separate stuff. They can develop in their corner however they want and that makes them happy. The final thing is a clusterfuck of services with little cohesion [YET they ALWAYS manage to put the shit together in production in a mostly working state].
The alternative implies to have a consistent and cohesive view of the components. For that, you'll need to have people with experience in architecturing scalable and sane production systems [so they understand what are the consequences and tradeoffs of ALL their decisions]. I know of very few people on the planet who can design systems. (We're talking past unicorn-rarity here). Plus, the developers must actually listen AND care about the long term maintenance AND the people involved (i.e. not want to do shit that will hit either them in 6 months or the team in the next office).
The amount of collaboration AND communication AND skills required to operate this strategy is beyond the realm of most people and organisation. There are very few individuals who can execute at this level.
The inherent greater complexity of designing extensible and cohesive systems turns away a large subset right away and it never gets to a comparative matrix to inform the decision making process. It's been reduced to "monoliths bad, microservices good" level of thinking.
I honestly think that the rise of microservices as "the technical cure for cancer and everything else" is a very interesting surfacing of the various systemic disfunctions of our beloved software industry. The disfunctions have been present from day 1 but the environment has somewhat radically changed.
You seem to be missing the point though. Most developers working on microservices projects don't have to deal with the operational complexity. They just need to work on their microservice and don't have to know what is going on elsewhere.
Personally I much prefer microservices for debugging. You can quickly identify which one is the problem then test the APIs in isolation pretty quickly. Sure beats having to wait 20 minutes for a monolithic app to build.
So what do these people do when they get data back from one of their dependency services, and it looks wrong? Or a bug is reported that somewhere in the chain, something is being dropped on the floor? You say you can quickly identify which one is the problem, but if that's a chain that spans 5 teams, how does that actually work in practice? (My experience is that it doesn't.)
That's the sort of thing I was including in operational complexity, not just the "are the packets making it through" stuff.
Exactly. Testing an API in isolation doesn't help you a thing when the bug arises because of, say, subtle inconsistencies between API implementations of interacting services on a long chain.
This sounds a lot like the code coverage fallacy. (to which I usually answer "call me when you have 100% data coverage").
I'm not a big green-IT guy, but always pushing your systems close to its load maxima and then backing off with the test traffic as real traffic comes in feels like an enormous waste of electricity.
Absolutely amazing to watch. I think most of the big companies (Amazon, Google) already have solutions for these issues like: limited number of languages, SLAs between services, detailed performance metrics and the ability to trace microservices.
Honestly I don't think the problem is microservices. I mean everything he brings up is true, but it's more of a "how you do microservices" issue.
I used to work in startups, and overall was impressed with velocity. Then I joined a big valley tech company, and now I understand.
It's because they hire smart and ambitious people, but give them a tiny vertical to work on. On a personal level, you WANT to build something, so you force it.
I think you solve this by designing your stack and then hiring meticulously with rules (like templates for each micro service), instead of HIRE ALL THE ENGINEERS and then figure it out (which is quite obviously uber's problem)
I quite enjoyed watching this. My takeaway isn't so much that this is a critique or endorsement of microservices, but rather just a series of observations. Not even lessons learned in many cases, just honest pragmatic observations. I like how he doesn't seem to judge very much – he obviously has his own opinion of several of these topics, but seem to let that more or less slide in order to not get in the way of these observations.
I found this video so super interesting and yet frustrating for completely personal reasons: the company I work for used to sell a tracing product that was specifically designed for the distributed tracing problem and handled all of the issues he highlighted - trace sampling, cross-language/framework support built in, etc. It was/is based on the same tech as Zipkin but is production ready. Sadly, he and his team must have spent a huge amount of time rolling their own rather than ever learning about our product. Now, it still might not have been a good match, but man, the problems he mentions were right in the sweet spot of what our product did really, really well.
This is why sales and marketing is so so important and yet overlooked/undervalued. You might have a great product that's perfect for your customer, but you also have to convince them to use it and pay for it.
Yes, I think that is the primary problem - they invested in Zipkin. At the end of his talk, he mentions the idea that you should prioritize core functions and use vendors for everything else. But he's right that some person created that code - and thus it becomes difficult to get them to switch. And instead it becomes a huge source of technical debt for a company that is non-core to their mission.
What I Wish Small Startups Had Known Before Implementing A Microservices Architecture:
Know your data. Are you serving ~1000 requests per second peak and have room to grow? You're not going to gain much efficiency by introducing engineering complexity, latency, and failure modes.
Best case scenario and your business performs better than expected... does that mean you have a theoretical upper bound in 100k rps? Still not going to gain much.
There are so many well-known strategies for coping with scale that I think the main take-away here for non-Uber companies is to start up-front with some performance characteristics to design for. Set the upper bound on your response times to X ms, over-fill data in order to keep the bound on API queries to 1-2 requests, etc.
Know your data and the program will reveal itself is the rule of thumb I use.
The main benefit of microservices is not performance, it is decoupling concerns. As scale goes up, performance goes down and decoupling gives a better marginal return: it's often easier to work with abstract networking concerns across a system than it is to tease apart often implicit dependenicies in a monolithic deployment.
Of course, this is context sensitive and not everyone has easily decouplable code bases, so YMMV. But not recognizing the myriad purposes isn't useful for criticism.
I think it's worth noting, it doesn't just split the code. It also splits the teams. 5 engineers will coordinate automagically, they all know what everyone is working on.
At perhaps 15, breaking up into groups of 5 (or whatever) lets small groups build their features without stomping on anyone else's work. It cuts, not only coupling in the code, but coupling in what engineers are talking about.
The two sort of obvious risks, those teams probably should be reshuffled from time to time, so code stays consistent across the organization. If that's done with care, specific projects can get just the right people. If it's done poorly, you sort of wander aimlessly from meaningless project to meaningless project.
The other one is overall performance and reliability. When something fans out to 40 different microservices, tracking down the slow component is a real pain. Figuring out who is best to fix that can be even worse.
a better marginal return: it's often easier to work with abstract networking concerns across a system than it is to tease apart often implicit dependenicies in a monolithic deployment.
Do the complexities introduced not have a cost, i.e. are those marginal returns offset by the choice in the first place? Call it "platform debt."
Oh they definitely do. It's just a different flavor and the debt scales differently. Ideally the debt amortizes across the services so you can solve a problem across multiple places. This is obviously really difficult to discuss or analyze without a specific scenario.
It's probably not worth discussing unless you're actively feeling either perf or coupling debt pressure.
I have no idea why people keep thinking microservices is all about scalability. Almost like they've never worked on a problem with them before.
Microservices is all about taking a big problem and breaking it down into smaller components, defining the contracts between the components (which an API is), testing the components in isolation and most importantly deploying and running the components independently.
It's the fact that you can make a change to say the ShippingService and provided that you maintain the same API contract with the rest of your app you can deploy it as frequently as you wish knowing full well that you won't break anything else.
It also aligns better with the trend towards smaller container based deployment methods.
You don't need to build a distributed system for that. Just build the ShippingServiceLibrary, let others import the jar file (or whatever) and maintain a stable interface.
The point is that unless you use JVM hot reloading (not recommended in Production) you will need to take down your whole app to upgrade that JAR. Now what if your microservice was something minor like a black word filter. Is adding a swear word worthy of a potential outage ?
If you run multiple instances of your server and follow certain reasonable practices, you can take down old-version instances one by one and replace them with new-version instances, until you have upgraded the entire fleet.
Alternatively, you can deploy the new version to many machines, test it, then make your load balancer direct the new traffic to the new instances, until all old instances are idle, and then take down these.
You certainly can do that. A little discussed benefit (or curse) of a microservice is maintaining an API (as opposed to ABI). Changes are slower and people put more thought into the interface as a discrete concern. I am curious if an ideal team would work better with a lib--I think so, but I'm not sure!
I think people don't maintain discipline if it isn't forced. So its a round about way of forcing people to maintain boundaries, that wouldn't be needed much of the time if people had stricter development practices.
> I have no idea why people keep thinking microservices is all about scalability.
It's an aspect. It's often the beginning of a micro-service migration story in talks I've heard.
> Microservices is all about taking a big problem and breaking it down into smaller components
... and putting the network between them. It's all well and good but the tradeoffs are not obvious there either. Most engineers I know who claim to be experts in distributed systems don't even know how to formally model their systems. This is manageable at a certain small scale as some computations take 35 or more steps to reveal themselves. But even the AWS team has realized that this architecture comes with the added burden of complex failure modes[0]. Obscenely complex failure modes that aren't detectable without formal models.
Even the presenter mentioned... why even HTTP? Why not a typed, serialized format on a robust message bus? Even then... race conditions in a highly distributed environment are terrible beasts.
You seem to be repeating these weird myths that have no basis in reality.
You don't need to be an expert in distributed systems to use microservices. It's literally replacing a function call with an RPC call. That's it. If you want to make tracing easier you tag the user request with an ID and pass it through your API calls or use something like Zipkin. But needing formal verification in order to test your architecture ? Bizarre. And I've worked on 2 of the world's top 5 ecommerce sites which both use microservices.
And HTTP is used for many reasons namely that it is easy to inspect over the wire, is fully supported by all firewalls and API gateways e.g. Apigee. And nothing is stopping you using a typed, serialized format over HTTP. Likewise nothing is stopping you using an message bus with a microservices architecture.
> You seem to be repeating these weird myths that have no basis in reality.
No basis at all? I knew I was unhinged...
> You don't need to be an expert in distributed systems to use microservices.
True. Hooray for abstractions. You don't need to understand how the V8 engine allocates and garbage collects memory either... well until you do.
> It's literally replacing a function call with an RPC call.
You're not wrong.
Which is the point. Whether for architectural or performance reasons I think you need to understand your domain and model your data first. For domains that map really well to the microservice architecture you're not going to have many problems.
And a formal specification is overkill for many, many scenarios. That doesn't mean they're useless. They're just not useful, perhaps, for e-commerce sites.
But anywhere you have an RPC call that depends on external state, ordering, consensus... the point is that the tradeoffs are not always immediately apparent unless you know your data really well.
> And I've worked on 2 of the world's top 5 ecommerce sites which both use microservices.
And I've worked on public and private clouds! Cool.
The point was and still is the same whether performance or architecture... think about your data! The rest falls out from that.
Yeah, that absolutely didn't compute for me either. If it really was the case with no thoughts needed for transactions or correctness I would have started to split my monolithic apps years ago. Its when your data model is complex enough to make consistency/correctness difficult without transactions you have a real trouble splitting things up.
RPC vs PC means introduces a greatly increased probability of random failure. If you're trying to do something transactional, you're in a different world of pain.
> [Scalability i]s an aspect. It's often the beginning of a micro-service migration story in talks I've heard.
It absolutely helps with people and org scalability. I haven't seen it help with technical load scalability (assuming you were already doing monolithic development relatively "right"; we ran a >$1BB company for the overwhelming majority on a single SQL server).
Basically if you're building a hypermedia REST API you return an entity or collection of entities whose identifiers allow you to fetch them from the service like so:
The client, if interested, can use those URLs to fetch the entities from the collection that it is interested in. This poses a problem for mobile clients where you want to minimize network traffic... so you over-fill your data collection by returning the full entity in the collection.
The trade off is that you have to fetch the data for every entity in the collection, the entities they own, etc; and ship one really large response. The client would receive this giant string of bytes even if the client was only interested in a subset of the properties in the collection.
GraphQL does away with this problem on the client side rather elegantly by allowing the client to query for the specific properties they are interested in. You don't end up shipping more data than is necessary to fulfill a query. Nice!
... but the tradeoff there is that you lose the domain representation in your URLs since there are none.
Great video. I went in expecting it to cover mostly the technical side of things. Instead Matt gave a great overview of the team / engineering organization dynamics to watch out for when adopting microservices. (I particularly liked that he pointed out how developers may write new code / services instead of dealing with team politics.)
Really enjoyed this talk. Our services don't quite (yet :)) run at that scale, but many of the issues mentioned have already peaked at some point. It's also good to have (more) validation to some choices we have made in the past, are currently making or are thinking about making in short term future.
On-call shifts are going to be interesting with an average of 2.5 engineers per service, not to mention handling people switching teams or leaving the company.
Taken in a healthy organization, Conway's Law is not a bad thing. It is really just more of an observation. So... taken to this example, it sounds like the company is a confusing mess of people trying to figure out who they need to coordinate with to make something happen.
True. I was using it as an argument to adjust the system architecture, as well. My argument was it didn't matter really which changed to get things into alignment, but that having them be different was a bit of a concern.
Is that strange? I've got 6 repos at work for various utilities and "personal" projects I'm working on in addition to the 4 repos for team wide projects.
I have over 3 dozen private Git repos for personal amusement or utility projects for stuff that professional consumer software doesn't work as well as what I want. Then I have almost at least 10 repos on GutHub (not this userid). At my last work, 9 month contract, I had 4 I think, aside from our main team repos just to manage personal info documents, my bin folder, etc. It's easy for one developer to have several. On my Linodes I keep /etc in its own repo as well, and my VirtualBox and VMware VMs then inside those I keep several Git repos.
Edit: I remembered more personal projects, make that almost 3 dozen personal Git repos.
My guess (and it is just a guess) is that each of these services is made up of multiple even smaller libraries which are each being version controlled as a distinct entity. I'm not sure I could fully get behind working at that level of granularity, but it would explain how you end up with quite so many git repositories.
From what I understand, they split their microservices into logical packages and libraries, each of which have their own repo. Then some microservices have a separate repo to track configuration changes.
This leads to an interesting perspective, that there is a hard limit on how much traffic your services will ever have to handle, and that limit is probably a constant expression on the planet's population.
But I can't imagine how you could possibly need 8000 git repositories unless you're massively over engineering your problem. Project structure tends to reflect the organizations that build them.
Which isn't necessarily a problem. Just put X shard on the same machine. If the largest city can be handled by a single server, I generally agree with your parent.
I've thought about this problem before, both for a related problem space and with friends working in this specific space. The short version is to create a grid where each square holds car data (id, status, type, x, y, ...) in-memory. Any write-lock of concern is only needed when a car changes grid, and then only on the two grids in questions. This can be layered multiple levels, and your final car-holding structure could be an r-tree or something.
The grid for a city can be sharded across multiple servers. And, if you told me that was necessary, fine..but as-is, I'm suspicious that a pretty basic server can't handle a tens of thousands of cars sending updates every second.
Friends tell me the heavy processing is in routing / map stuff, but this is relatively stateless and can be sent off to a pool of workers to handle.
I always wonder in these cases about giving each car an actor in Erlang/Elixir and having complete network transparency, message handling and crashes handled for free.
The routing is very complex too but as you note scales well, until you want to start routing/pickups based on the realtime location of other cars.
I've given that kind of model a lot of thought too, but I did not find very satisfying solutions to these problems:
- How do you distribute the car actors on nodes, assuming the number of nodes is variable? (I think riak_core looks interesting, but it does not seem to have a way to guarantee unicity of something - it's rather built to replicate the data on multiple nodes for redundancy)
- What happens if a node fails? What mechanism is going to respawn the car actors on a different node? How do you ensure minimal downtime for the cars involved?
- What happens if there's a netsplit, e.g. how do you ensure no split brain where two nodes think they're responsible for a car?
It feels to me like the traditional erlang-process-per-request architecture coupled with a distributed store (riak or w/e) makes it possible to avoid the very difficult problem of ensuring one and only one actor per car in a distributed environment.
For the netspit stuff I'd look at Phoenix.Presence works and see how that handles it with a CRDT. There are various ways of distributing things across nodes in Erlang/Elixir but maybe you'd need to build something.
I think you are right about the riak core stuff - you could probably keep track of cars using some sort of distributed hash and kill multiple cars if they were to ever spawn.
In fact a way of instantiating processes and finding them based on a CRDT is probably a pretty cool little project...
I think that's a cool idea. It would have the downside that error recovery could take a while though, depending on the permdown period, so during that time a driver would be stuck; while in a request-based system they can immediately try again and it would work (hit a different instance). But it might not be too bad.
Care to explain which bits Erlang/Elixir doesn't do or at least provide a solution for? As the sibling show you may need to be smart about certain things but the OTP definitely provides ways of solving the above things.
When the Facebook IPO crashed the NASDAQ, I suspect it's because NASDAQ was sharding by ticker symbol. That was rational until one ticker became 1/3 of the trading volume.
No, that had absolutely nothing to do with it. It was based on a bad assumption about volume of trade entries/cancellations during the initial price calculation.
That doesn't contradict my conjecture that the issue was improper sharding. The jump in duration of cancellation detection from 2ms to 20ms could have been because they were running that calculation on a single machine.
Although... Ethernet latency would probably make it tough to stick to 2ms.
That's like a post office way of thinking about the internet, we already past that point. Plus you haven't solved the problem of managing 800 microservices and 2000 programmers, but now you have a deployment problem also.
Nah, you're making a mistake with the zeroes, you should totally stick to all ones instead, everyone knows ones are better than zeroes (they're one more man! are you blind?!)
Not really what the video addresses for the most part. Mostly talking about their development process and impact of breaking up their application into 1000s of microservices. This is more to handle their development team scaling, rate of change etc...
I think that only parts can be sharded by city. Anything having to do with users has to be global (because I can visit different cities). I suppose you could try to "move" users from shard to shard but you'd still want a globally unique ID for each user (so you could run cross-city analytics). It feels like you'd be building so much orchestration around making the global parts work in sharded manner that you might as well make the whole thing work globally.
To me, it seems the best thing to do in that scenario is taking the huge "shard by city" gain when it comes to rides. Then dealing with the global nature of users by having global user DBs sharded by user ID.
Everything related to driver selection and real time tracking is very specific to one location. The server responsible for tracking drivers in San Jose doesn't need to reference rider locations in NYC under any circumstance.
The things you highlighted are very light load comparatively. Sign in, load account info, then dispatch to location specific shard.
I think the world of service architecture is roughly divided in two camps: (1) people who still naively think that Rest/JSON is cool and schemas and databases should be flexible and "NoSQL" is nice and (2) people who (having gone through pains of (1)) realized that strong schemas, things like Thrift, Protobufs, Avro are a good thing, as is SQL and relational databases, because rigid is good. (Camp 1 is more likely to be using high level dynamic languages like Python and Ruby, and camp 2 will be more on the strongly typed side e.g. C/C++, Go, Java).
> people who still naively think that Rest/JSON is cool and schemas and databases should be flexible and "NoSQL" is nice and
Yes, and no.
Yes: REST / JSON is nice. I've used them widely as a kind of cross-platform compatibility layer. i.e. instead of exposing SQL or something similar, the API is all REST / JSON. That lets everyone use tools they're familiar with, without learning about implementation details.
The REST / JSON system ends up being a thin shim layer over the underlying database. Which is usually SQL.
No: databases should NOT be flexible, and "NoSQL" has a very limited place.
SQL databases should be conservative in what they accept. Once you've inserted crap into the DB, it's hard to fix it.
"NoSQL" solutions are great for situations where you don't care about the data. Using NoSQL as a fast cache means you (mostly) have disk persistence when you need to reboot the server or application. If the data gets lost, you don't care, it's just a cache.
> SQL databases should be conservative in what they accept. Once you've inserted crap into the DB, it's hard to fix it.
You can make your schema very light and accepting almost like NoSQL which is how you get into the situation you described; the solution is to use stricter schema. That and it helps to hire a full time data engineer/administrator.
> Using NoSQL as a fast cache
I'd rather use caching technology, specifically designed for caching, like Redis or Varnish or Squid.
At one time I implemented services that honoured the robustness principle - "be liberal in what you accept, and conservative in what you send". However, I have found that if you are liberal in what you accept, other services can end up dependent on quirks of your liberal acceptance. More recently I have become a believer in the "be conservative in both" strategy. Rigid is good.
Camps 1 and 2 are not mutually exclusive (well, except your inflammatory "naively" comment).
Rest/JSON is a well understood, broadly adopted, low friction RPC format.
NoSQL is not always MongoDB (for example, Google Datastore is ACID compliant), and schema enforcement via an ORM layer I would argue is actually a good thing, as it provides schema validation at compile time.
The longer a database has existed, the more likely somebody in the company wrote something crucial that accesses it without your knowledge and without going through your ORM (usually because your ORM isn't implemented for the language they're using, or it emits bad queries for their use case). Sanity checks that aren't enforced by the database can't be relied on to be up to date or even happen at all.
Rest/JSON is not a RPC. Technologies like Thrift can be used for RPC. It depends on how you define low friction for rest + JSON. JSON is schemaless, and that can be great for prototyping, but as soon as deployed services get out of sync, in terms of how they communicate, it becomes more of a burden than an advantage. Thrift, protobuf, avro, can enforce schemas and can raise exceptions for communication mismatches so less defensive programming is needed checking json responses. For internal service communication, I really think using a schema enforcing communication protocol is a good thing.
In the absence of a good reason not to use Python (and there are many case where one should not), I use it quite a bit myself. And usually (absent good reason) I use it with PostresQL. SQLAlchemy is a wonderful thing.
Granted, I am neither a database nor ORM savant, but I find that it makes explicit almost as easy as implicit - but with more safety! I haven't seen that elsewhere, but I haven't looked very hard either. I have heard claims that Groovy/Hibernate do this just as well as well, but it isn't clear to me that this is completely true.
Conceptually, everything is a Schema and then you generate Changesets for DB operations like insert, update, etc. and apply various validations and transformations as a chain of function calls on the input map. No such thing as a model anymore. It fits really nicely with a data > functions mindset.
Is your first example really "naive" though? In my experience, loose, flexible schemas and dynamic languages are very well suited to rapid early-stage development, much more so than rigid languages and schemas.
Sure, in the long term things should be refactored, structured, and optimized. But if you do that too soon you risk locking yourself out of potential value, as well as gold-plating things that aren't critical.
> Sure, in the long term things should be refactored, structured, and optimized.
How often does that really happen though? Once you've amassed enough technical/data debt, resistance to refactoring increases until it never happens at all. Having well defined, coherent data models and schemas from the start will pay off in the long run. Applications begin and end with data, so why half-ass this from the get go?
Assuming you're not omniscient you'll be refactoring regardless. The difference is whether you'll be paying as you go (clients want a JSON API, we need to add new columns to a table but it'll lock rows) or if you'll be taking on technical debt to be repaid in the future (turns out Mongo sucks and we would do much better with Cassandra for serious horizontal scaling).
I believe that if you aren't extremely certain about what the future holds it may be best to work with a more flexible technology first and transition to a more structured setup once you have solved for your problems and identified intended future features. And if you are extremely certain about what the future holds you're either insanely good at your job or just insane.
I think the key is to do regular refactoring as you go -- it has to be an ingrained part of the process. It's really a management issue. Not every company/project has the foresight to budget for this, of course. If a team can't or won't regularly improve their infrastructure, then yes, a more structured approach would probably be better for anything that needs to last.
Of course there are other considerations. A more "planned" structure always makes sense if you're talking about systems or components that are life-critical or that deal with large flows of money. The "fast and loose" approach makes the most sense when you can tolerate occasional failures, but you have to have fast iterations to be quick-to-market with new features.
In my experience the likelihood of your scenario (increasing resistance to refactoring)
is almost always indirectly proportional to the amount and quality of refactoring tools available.
500KLOC JVM/.NET application? No big deal.
50KLOC JS/HTML-based SPA? Pfooooh. That could take a while...do we really need to?
This is the key point I think. Flexibility is very important when a project is immature. And as it matures it benefits from additional safety. Too bad so few tools support that transition.
Camp 3: those of us who think "shove something, anything that 'works' out as fast as possible" is a recipe for disaster, irrespective of the transport format or service infrastructure.
I suppose folks in this camp would overlap significantly with those in camp 2, though.
REST/JSON makes sense between a human backed browser and a middle tier; between machines, SOAP is the best contract first technology, or at least a custom XSD backed XML schema.
I don't understand the conflation of Rest/JSON with NoSQL. The most popular frameworks for the dynamic languages you cite all use SQL with strong schemas out of the box. Come to think of it I'm not sure why we're even associating Rest with JSON specifically or any content type for that matter when one of the major points of Rest is content-negotiation via media types. You can even implement a messaging concept with Protbufs in a strictly "rest-like" fashion.
You're missing the point where you have a flexible, schemaless design while you're specing out the service and seeing if it's even necessary, viable, and will be needed for a long time for a long-lived system.
Properly encoding rigidity through type systems, SQL checks/triggers/conditions, etc is hard. It takes a really long time to really iron out all the string and integer/double/long typing out of your system, let alone do it in a way which matches up properly with your backing datastore. Once you've got it set up and nailed down with tests, then you're golden, but that's a cost that is usually not worth paying until long-term need is determined.
(3) people who ponder what's the best solution for a given situation? Sometimes you want dynamic languages, or schema-less data. Sometimes you want rigid interfaces and verifiability.
Eg rolling out a public API in Thrift/protobuf will severely difficult its adoption, whereas Rest/JSON is pretty much the standard - but then building microservices that communicate to each other in Rest/JSON quickly leads to a costly, hard to maintain, inconsistent mess.
We're obsessed with "one fits all" absolutes in tech. We should have more "it depends" imho
I've worked on both types of projects. My previous project used Scala on the backend and Node for the public API. Thrift was used to maintain persistent, reusable, connections between the public API and the backend, and to enforce strict typing on message values.
What I really liked about Thrift is that all I needed to know how to use the service was the thrift definition file. It was self-documenting.
My current project uses HTTP and JSON to communicate from the public API to the backend services. There is significantly more overhead (latency and bandwidth) and no enforced document structure (moving toward Swagger to help with that).
HTTP+JSON is great for the front-end where you need a more universally parsable response, but when you control the communication between two systems, something like Thrift/Protobuf solves a lot of problems that a common with REST-ish services.
No No No. This is like asking which is a better tool - a nail or screw? Since everyone is so familiar with SQL, they are avoiding very good reasons to use NoSQL.
'NoSQL' is just a broad term for datastores that do not normally use standard Structured Query Language to retrieve data. Most NoSQL do allow for for very structured data as well as some query languages that are similar to SQL.
BigTable (Hadoop, Cassandra, Dynamo) and block stores (Amazon S3, Redis, MemcacheD) are absolutely critical to cloud services. Json tuple document DBs are needed for mobile and messaging apps. Graph is for relationships, and Marklogic has an awesome NoSQL Db focused on XML.
Full disclosure: I am the founder of NoSQL.Org - but I also use multiple relational SQL databases every day.
Although companies like Snowflake are bridging the gap between the two camps, with products that can natively ingest JSON and Avro, and makes it queryable via SQL allowing for joins etc.
It's possible to use SQL to retrieve data from a text file; that's not the same as using SQL backed by an actual relational database engine (and a well designed schema) that has been perfected over many decades.
1. new business evolving quickly to meet and discover the product that fits the market they are chasing.
2. established and optimizing for a possibly still growing market but very well established set of features and use cases. They can take longer to deliver new features and can save lots of money by optimizing.
Highly recommended video.
Lots of stuff he spoke are very relatable. Like having many repos , storing configs as a separate repo, politics by people, having a tracking system
When did we loose our heads and think such an architecture is sane? The UNIX philosophy is do one thing and do it well, but that doesn't mean be foolish about the size of one said thing. Doing one thing means solving a problem, and limited the scope of said problem so as to have a cap on cognitive overhead, not having a notch in your "I have a service belt".
We don't see the LS command divided into 50 separate commands and git repo's.....