While I know it’s not the main point of the article, I’m getting really tired of the anti-sysadmin bias in pretty much anything related to devops. It seems to be a favorite trope to paint sysadmins as a dying breed of monkeys who are only capable of keeping pets and doing manual tasks when they get around to lifting their knuckles off the floor. Who do you think is writing the playbooks for ansible? How do you think you can even write those things without first having a deep understanding of how the system works and having a set of procedures around them? Who do you think runs the systems your containers sit on top of?
Just because infrastructure as code appears to use the same tools that developers use (text files, git) doesn’t mean developers can do the job. Me having access to a pen and paper doesn’t make me Shakespeare.
This is especially harmful in an article that claims to be aimed at management, ostensibly trying to set the future path for organizations.
If you're writing Ansible playbooks and keeping a deep understanding of the system in your head, just call yourself "devops" or "SRE" and roll with it....
More to the point, anyone who can write good, robust Ansible playbooks and who can actually deeply understand the system to the point of being able to debug a misbehaving system and reproduce problems effectively is necessarily a competent, qualified developer. You might not enjoy writing large new software systems (I don't) but you certainly could do it, you have all the skills and then some. If you're a qualified operator who is also a qualified developer, "devops" is probably a more accurate descriptor than "sysadmin" (even though, yes, in theory "devops" is not a job title but an organizational practice).
> It seems to be a favorite trope to paint sysadmins as a dying breed of monkeys who are only capable of keeping pets and doing manual tasks when they get around to lifting their knuckles off the floor. Who do you think is writing the playbooks for ansible? How do you think you can even write those things without first having a deep understanding of how the system works and having a set of procedures around them? Who do you think runs the systems your containers sit on top of?
I totally see where you're coming from, but I suspect we've both been doing sysadmin for a really long time. You might remember in say 90s to mid-2000s
- 'sysadmin'. Subscribes to 'login;', is a member of SAGE. Views his job as automating his job until there is nothing left. Probably used perl and shell, then Python and Ruby.
- 'sysadmin'. Person who sets up the server. Knows the basics of Windows NT or 2K. Doesn't really like to program. Automation is a nice to have, they don't script because they're 'not a developer'.
I choose to view 'devops' as a way of the former group from separating itself from the latter. The people who acted like 'DevOps' was a new thing probably just knew too much of the latter groups and didn't know enough Perlmongers and alt.sysadmin.recovery types.
I see that but I also see the opposite. Sys Admins blaming monkey devs for things like requiring all ports open or rdp access with root.
There's so much negative blame torwards the workers for mgmt decisions and the resulting shit.
Ex: we have one dev environment with 250 bare metal machines that each dev can ssh onto any of them. When we transitioned from ClearCase to git git was only installed on 10 of them.
So now sys Admins blame the devs for not requesting them all, devs blame the sys Admins for not knowing it when the reality is the mgmt decision to not prioritize dev prod parity is the problem.
There is definitely Sysadmins blaming devs, but not because they're monkeys.
The attitude that Devs have about Sysadmins comes from arrogance. Devs are arrogant that Sysadmin work is monkey work, and "they could easily do it if they had to because it's not nearly as complicated as writing code". It's a complete lack of respect for Sysadmins.
Sysadmins blame Devs because as soon as Devs are put in the position of doing ops work, they completely and utterly fail, but instead of changing their attitude about Ops work (i.e. that it's actually a real and separate skillset than dev work), they maintain their arrogance that the work is dumb and "I could do it if I really had to".
Both issues are caused by arrogance of the Devs... the way Devs see Ops work, and the way Devs interact with Ops people which is then reflected back at them by Ops.
And in your example, Ops has it right. If Devs cannot do a basic thing like provide specs on the environment/tools they need to do their own work (what, you really don't know that 'git' is a tool you use, even though you type the word 'git' 20x a day?), how is Ops supposed to do anything about that?
I wonder how much of a problem this is. Anecdotally:
I've spent my entire career - about 5 years beyond university - working in a cloud-only environment. My dev machine is an ec2 instance, whose environment my company provisions appropriately, and we have tooling for managing the application environment end to end where I rarely have to worry about linux primitives.
I can put together some pretty impressive things on AWS, reasonably comfortably, but I know that if I ever leave for a company that still runs their infrastructure classically, I'll be at a disadvantage. I've even stumbled around one-off linux environments for personal projects, most of the concepts are still foreign to me.
Funny, I come from the opposite direction. I've been using Linux¹ and FreeBSD since the early 1990s, and was using UNIX since the mid 80s (SunOS, AIX, ...)
I find it really easy to deploy things in the cloud. I especially find it easy to fix cloud deployments when they don't work, compared to my cloud only colleagues.
If we're not careful, I think we might start to lose some basic OS level skills.
There are stories, perhaps apocryphal, about certain tribes of South American natives who find axe heads discarded by another tribe and see it as the work of the gods.
They don't know how to make their own axe heads, and see them as mystical objects.
It's not anyones fault for not being born early enough to have ridden Linux from its infancy, but I would encourage people to learn the basics of the OS.
Putting together cloud offerings, that's only product knowledge.
We used to have four groups voting against management's worst ideas. We got rid of the DBAs, we got rid of QA. Now we're getting rid of Ops. And while that means that we have more power, it means we have a lot more responsibility, because it's just us against ignorance and bad management now.
And we all sit in a big room where we don't get time to consider our replies before giving them, which means if we make a rhetorical error it doesn't matter how valid our concerns are.
I really don't want to run my own company, but it's starting to feel more like I have to if I want anything like sanity.
Developers did not took your job, it was the machines...
Kubernetes strength is its declarative nature. I.e. the actual "tasks" (the imperative part), moved into the controller code and is being automated away.
As more and more operators/CRD (i.e. automatic sysadmins) get written, more jobs will disappear.
For example, you can imagine a database operator that just by stating your schema (as an CRD), will create the tables for you, the indices, do automatic backup/restore, do automatic migrations, create the monitors and the alerts. I.e. will completely replace the DBA.
There already is a database operator that can do everything you have listed and a lot more ( I helped write it ), but it doesn't remove the need for DBA's. It just changes what they do on a day to day basis. I think this is what OP was saying. Just because you have an operator that can do most of what a traditional DBA does doesn't mean you can replace all the world's DBA's. Someone still needs to know _why_ a specific query managed to do a multiplicative join and lock all your tables for hours, even if the operator knows how to flag that query and reject it.
Sure, but in a day of a dba, or even over a month, how much does he deals with understanding deadlocks?
BTW, the operator that I saw are not there yet. What I want to see is a CRD for a set of input schemas, and an output schema, and let the operator create the most efficient query.
I suspect that, if an organization, especially one with its own hardware, tries to move to Kubernetes in a cargo cultish way and without understanding the nature of the undertaking, this could easily lead to disaster and people getting fired.
That's another way Kubernetes can take away jobs :).
It replaces the DBA for routine maintenance. It does not replace the DBA when you need an expert to help you plan your database architecture and figure out what went wrong in production.
As compute get cheaper, the design goal of those new systems is to replace human time with computation time. This is the key idea behind automatic machine learning, for example.
Image a system that needs to map logical schema or architecture into the best implementation. The system can create 1000's of combinations, and test them against your data/queries. And come up with the best design.
I'm an old school sysadmin that recently has been making a buck doing "devops consultancy". In reality, my job is fixing stuff that the developers "automated" (because I don't really think an automation that is not performing any work is an automation at all). I've seen some scenarios that are hard to describe outside of my sysadmin expertise with something other than silly.
Not so recently I hear a lot of some Jessie guy, brilliant, who developed and amazing product and was a very good engineer. His team all made clear that this guy was so efficient that he was in charge of a lot of stuff; sadly he didn't documented a vital step of the development pipeline that was causing all the builds to fail and he left the company years ago. His team, that was at least capable enough to keep the ship afloat for more than two years without Jessie who was the driving force and main engineer, combed the documentation and code but couldn't find the problem, they appear to be pretty good coders and the codebase was clean enough that I do not question their capacity as developers, but the problem was too deep, mainly because as they told me, this Jessie guy was pretty great.
Well, actually the problem is that Debian Jessie reached EOL some time ago, and this guys recently they migrated the build server, losing all the image cache. They were unfamiliar enough with errors on the package manager and with the shadow of this Jessie guy that even if they found something about the backports url, they didn't thought it was related to their issue. Just changing the repo url was enough to keep them working again so I could take a look at the rest of the stuff.
That's just the silliest one I can think about, but just one of many. I'm certainly not over impressed with developers doing "devops" by themselves.
When you have paying customers or external SLAs, you need 2 departments:
1) DBA
2) Eng. Security
If you want to avoid hiring those people, there are things you can do:
1) design your app around a K-V store, like DynamoDB. Reporting (ie. aggregation) will be the tough part.
2) Have a sr. engineer review security weekly and file jiras that are followed up. Don't agree to any IT compliance standards without full-time security engineers.
To directly answer the parent poster, what he's talking about are the basic mechanical things that a DBA does, which is a small portion of a DBA's responsibilities.
For example, just deciding what to monitor and how to address the alerts on large tables alone justifies having a DBA. Coinbase and Gitlab have both had extended outages recently because of inadequate DBA resources.
When I talk to programmers, I start with, "Our masters have 400 days uptime. How does your new system maintain that level of availability?" Typically that's the end of the conversation.
> Who do you think is writing the playbooks for ansible? How do you think you can even write those things without first having a deep understanding of how the system works
The sys-admins don’t even know what ansible is. The devs are writing the playbooks and trying to convince the sys-admins that terraform and packer/docker are great tools while the sys-admins are still remoting into windows servers and clicking though graphical installer dialogues to configure the oracle servers “just so”.
As for the “deep understanding” that was replaced with ready built AMI’s and packer/docker configs ages ago. You don’t need a local expert when you have the end result readily downloadable from a central trusted source. (And embarrassingly the available images are often better than what the local sys-admin thinks is top of the line).
And at the end of the day, while it’s extremely cool to see someone utilize 20 years of experience to squeeze out an extra 15% of performance from your setup... well an extra server which adds +98% performance costs like 10$ a month, and the expert of 20 years is just not cost effective.
There’s a culture clash, and it’s not because expertise isn’t valued. It’s just that a lot of old sys-admins have spent their entire carriers optimizing against targets that are no longer relevant, and teams can get by perfectly well without them.
Truth is that you honestly don’t need a sys-admin or DBA on modern team. The ‘kids’ do just fine without them, and just spin up the systems on AWS/Azure/GKS with a blatant disregard for the fact that they “do not yet have the needed seniority to carry out such important tasks”. And if the horrible complexity of it all becomes to
Much, they spin up 3-4 variants of the infrastructure and use the one that works best in performance and reliability testing, without ever needing to seek the alter of the almighty sys-admin with gifts of gold and silver in order to bless their endeavor and be allowed the extra couple of servers.
We don’t have coffee machine admins because coffee machines are trivial in the modern software company (even though they do of-cause need regular maintainer and cleaning and are extremely important). The same is true of sys-admins, they are no longer needed because the system procurement and provisioning are trivial (even if they also need regular maintenance and cleaning and are of-cause extremely important).
Your understanding on what sys-admins are/do is very limited/off-base. I'm a sys-admin for engineering department of a public university. We use Puppet and Ansible to manage a diverse array of workstations, compute servers, and storage servers all hosted on-premises. These machines exist to serve undergrad student lab needs, research compute clusters, and public and internal infrastructure.
There is a world outside of your bubble where servers exist to run a product.
I find it so interesting to see the sort of bubbles that people live in and believe in.
There is a big % of the sp500 companies with huge infrastructure inhouse, legacy system and sysadmins generating most of the money in the world and yet you think that kids that can script against a cloud API have them in their pockets.
The “kids” are typically devops engineers with 5-7 years of experience. They get called kids by sys-admins because they prefer postgre to oracle, and Linux to windows. Because they don’t have the same 30 years of experience doing roughly the same thing in the same org.
Of cause there are systems-admins who adopt and are now working half of a devops workflow, but they seem to end up either becoming devops engineers or finding a new job where they are back to being Kings of a closed domain that developers only interact with though emails and audiences.
For the life of me I can’t figure out why I would recommend Kubernetes to any company who is already on AWS. Except for the custom stuff, you should probably used a managed equivalent and for the custom parts where you need HA, scalability, etc. just use regular old ECS or Fargate for Serverless Docker. Heck even simpler, sometimes is just to use a bunch of small VMs and bring them up or down based on a schedule, number of messages in a queue, health checks, etc and throw them behind an autoscaling group.
I’m not familiar with Azure or GCP, but they have to have more easily managed offerings than K8s.
If you’re on prem - that’s a different use case but my only experience being responsible for the complete architecture of an on prem solution was small enough that a combination of Consul/Nomad/Vault/Fabio made sense. It gave us the flexibility to mix Docker containers, raw executables, shell scripts, etc.
That being said, for both Resume Driven Development reasons and because we might be able to find people who know what they were doing, if I had to do another on prem implementation that called for it, I would probably lean toward K8s.
Yeah, the main reasons are social. Everyone wants that k8s feather in their cap just like everyone wants that AWS feather in their cap. Sure, there may be times when it's reasonable, but 95% of the hype is hot air.
There are two reasons you have to get a k8s cluster. First is that if you don't have one set up defensively, some slick sales guy will come in and scare your boss's boss's boss into thinking the company will implode without one. If you have one, you can say "don't worry, we already have that" and fend off the invasion. It's a necessary status symbol, like saying "we turned off our last physical server because we love paying Amazon 7x the cost of hardware for less control and less performance every year".
The second reason is that the pre-eminence of the fad will create a feedback loop where some tooling assumes that you have a k8s cluster because of course all of the cool kids have a k8s cluster and make it inordinately difficult to do something reasonable without one. It's thus handy for experimentation with presumptuous tools created by people who have drunk that kool-aid. We're already seeing this to a certain extent.
The harsh reality is that the vast majority of tech architectures are designed as fashion statements rather than systems to serve a functional purpose. With this reality in mind, we must look good on the runway or risk expulsion.
Would really love to see some before/after cost calculations for some cloud migrations. Techies, and especially the more senior ones at CTO level, can easily be scared into thinking that they need to use cloud, partly because they are so far removed from the tech, and partly because they just follow the trend and everyone else is doing it (similar to "no one got fired for buying IBM") Cloud is great for many use cases but a lot of companies just go all in even where it doesn't make sense. But perhaps everyone is getting 90% discounts at having the last laugh...
I've seen several large-scale cloud migrations and the bill has always been higher, usually egregiously so. In one case in particular, we would've been able to re-buy all the (perfectly adequate) hardware in our racks every 45 days if we had been pouring the cloud spend into hardware.
In another case, I've seen a company that spends dozens of thousands of dollars a month on the cloud infrastructure to run a site that services a max of 50 concurrent users. The truth of that matter is that the production site could run just fine on any developer's laptop and a one-time spend on a pair of geographically-dispersed dedicated servers would free up huge amounts of cash without any measurable/actual impact, but the bosses won't feel very important if they acknowledge that. It boosts their self-image to have a big cloud bill and feel like a grown-up company because they're paying big invoices, and plus the CxOs can prance around and tell everyone how forward-looking they are because they're "in the cloud".
It seems like the most common pitch is "cloud is usage-based billing" and people operate under some vague theory that this will translate to savings somewhere, but despite popular belief, most workloads are reasonably static and you're just going to pay a lot more for that static workload.
The fantasies of huge on-demand load are mostly a delusion of circular self-flattery, aggressively pushed by rent-seekers and eaten up all too eagerly by people who are supposed to be reasonable stewards and sometimes even dare to call themselves "engineers".
By all means set up the cloud stuff and have the account ready to take true ad-hoc resource demands, but the number of cases where AWS and friends are an actual net savings over real hardware is infinitesimal. Most companies would be much better off if they invested in owning at least the baseline 24x7 infrastructure.
I guess the issue there is that since most companies don't really have the dynamic demand they imagine, if they actually used cloud providers for elasticity, they'd almost never use them and then they couldn't feel cool enough.
If you're a random guy, it's going to be cheaper and better to run on a Linode or small AWS instance than it will be to rent and stock a rack. If you have more than 5 employees, this is almost certainly not true.
In another case, I've seen a company that spends dozens of thousands of dollars a month on the cloud infrastructure to run a site that services a max of 50 concurrent users. The truth of that matter is that the production site could run just fine on any developer's laptop and a one-time spend on a pair of geographically-dispersed dedicated servers would free up huge amounts of cash without any measurable/actual impact,
I find it hard to believe that even if I went out of my way to throw every single bit of AWS technology I know that I could Architect a system that only has 50 customers where I could make it cost that much more. I could do that with AWS with a pair of EC2 servers, a hosted database, a load balancer and an autoscaling group with a min/max of 2 for HA. That includes multi AZ redundancy. Multi region redundancy would double the price. That couldn’t possible cost more than $500 a month
Here the software side of the fad rears its head: there are about four dozen microservices involved, each with its own RDS instance, load balancer, the works. 5-6 different implementation languages were used and a large number depend on the JVM or other memory-hungry runtimes. There are a couple of so-called "data analysts" who don't really know what they're doing, never produce anything, and spend lots of money on EMR et al. Buzzwords abound.
The workload is containerized and orchestrated (of course, since a company so self-conscious about its tech fashions would never not be) but one can only increase the density so far, and obviously optimizing the infrastructure spend on "sexy cloud stuff" hasn't been the top priority.
Even hinting that hardware may be appropriate for a certain use case will bring out the bean counters in force. At a third company, I almost gave the "Global VP of Cloud Computing" an aneurysm by suggesting that there may be a use for some of the tens of millions of dollars of hardware that they'd recently purchased. In shock and disbelief, he shouted "What, now you're talking hybrid cloud?!" I said "if that's what you want to call it" as the rest of the room jumped to inform me that the R&D departments at the cloud providers ensure customers will always be using the latest datacenter technology, hastening to add that Microsoft is building a datacenter underwater somewhere, and thus it's a lost cause for anyone to run their own hardware. Some of the shadier cronies in the room chimed in to add that the hardwareless course of action had been confirmed as the ideal by both IBM and Accenture in studies commissioned by the VP.
Cloud resources are a useful tool in the toolbox, but as an industry, we have gone way overboard and lost all reason. At some point, when cloud inevitably loses its shine, the bubble must pop. If you're in the market for server hardware, this is a great time to buy.
Crazy costs for simple stuff can easily happen with on-premise systems as well - I once had an in-house infrastructure team quote £70K for infrastructure to host a single static HTML page that would be accessed by about 10 people.
There was even a kind of daft logic to their costing - didn't make it any less crazy.
If you do your cloud migration and just do a lift and shift without changing your processes or people (retrain, reduce, and automate), it will always cost more. The problem is that too many AWS consultants are just old school net ops people who watched one ACloudGuru training video, passed a multiple choice certification, and can click around in a GuI and replicate an on prem architecture.
I’ve never met any that come from a development or Devops background and know the netops side.
Well, seeing that there are only 24 hours in a day and that I refuse to work more than 40-45 hours a week....
There are two parts to any implementation - the parts that only you or your company can do - ie turn custom business requirements into code and the parts that anyone can do “The undifferentiated heavy lifting” like maintaining standard servers. Why would I spend the energy doing the latter instead of focusing on the former?
If I have an idea, how fast can I stand up and configure the resources I need with a private server room as compared to running a CloudFormation Template? What about maintenance and upgrades?
How many people would our company have to hire to babysit our infrastructure? Should we also hire someone overseas to set up a colo for our developers there so they don’t have to deal with the latency?
We are talking about a situation where you already have a server room and employees.
Typically what I've seen is that the developers are being starved out for resources in the on-prem hardware, and no amount of complaining or yelling or saber-rattling seems to do anything about it. But along comes cloud and we are willing to spend many times more money. The devs are happy because they can spin up hardware and apologize later, which feels really good until you find out people are spinning up more hardware instead of fixing an n^2 problem or something equally dumb in their code (like slamming a server with requests that always return 404).
We are talking about a situation where you already have a server room and employees.
And by “changing your processes” I guess I should also include “changing your people”. Automate the processes where you can, reduce headcount, and find ways to migrate to manage services where it makes sense.
The devs are happy because they can spin up hardware and apologize later, which feels really good until you find out people are spinning up more hardware instead of fixing an n^2 problem or something equally dumb in their code (like slamming a server with requests that always return 404).
I hate to say it, but throwing hardware at a problem long enough to get customers, prove the viability of an implementation and in start up world, get to the next round of funding or go public and then optimize is not always the wrong answer - see Twitter.
But, if you have bad developers they could also come up with less optimum solutions on prem and cause you to spend more.
With proper tagging, it’s easy to know where to point the finger when the bill arrives.
My company (a large-ish pension plan, ~10 dev teams, most applications are internal) was on AWS Beanstalk and switched to Kubernetes (EKS). Once we moved in and cleared the initial hurdles, the overall impression is that K8s is very pleasant to work with: solid, stable, fast, no bad surprises. Everything just works. We probably spend 0.1 FTE taking care of Kubernetes now. Definitely was worth the cost.
All AWS tech I've tried before (ECS, Beanstalk, plain EC2, Cloud Formation) is slower, has random quirks, and needs an extra layer of duct tape.
If you need anything more than just ECS and you don't choose Kubernetes, you may end up rolling your own, in-house, less reliable subset of Kubernetes. You'll need to train everyone how to use it, and how to understand how it works when something goes wrong.
Choosing Kubernetes is like choosing Java -- it's a standard, there is a huge ecosystem of supporting software, the major cloud providers officially support it, and you can hire people who have worked with it before. Whether or not it's overkill is less important than those other factors.
As a counter-point, for anyone who has used kubernetes, trying to go to anything else feels like "kubernetes-lite" and very, very often requires duct-taping the disparate parts together because they weren't designed to be one ecosystem.
If one's use-case fits within the 20% on offer of AWS's "20/80 featureset," and one enjoys the interactions with AWS, that's great. To each their own.
But I can assure you with the highest confidence that there are a group of folks who run kubernetes not because of resume driven development but because it is a lot easier to reason about and manage workloads upon it. I know about the pain of getting clusters _started_, but once they're up, I find their world easier to keep in my head.
We seem to be forgetting KISS - at some point yes we need to use massive scalable architectures - but humans got on fine treating large numbers of animals as not pets long before we went full on intensive chicken farming.
There is a lot of space between pets and Google-scale
None of these can run as Windows Services by themselves. I had to use NSSM. That being said.
-Consul a three line yaml configuration to set it up in cluster. It’s a single standalone executable that you run in server mode or client mode.
- once you install Consul on all of the clients and tell it about the cluster, the next step is easy.
- run the Nomad executable as a server, if you already have Consul, there is no step 2. It automatically configures itself using Consul.
- run Nomad in client mode on your app/web servers. If you already have the Consul client running - there is no step 2.
- Vault was a pain, and I only did it as a proof of concept. I ended up just using Consul for all of the configuration and a encryption class where I needed to store secrets.
Did I mention that we had a lot of C# framework code that we didn’t want to try containerize and Nomad handles everything.
That being said, I wouldn’t do it again. If we had a pure Linux shop and the competencies to maintain Linux I would have gone with K8s instead if I had to do an on prem implementation.
But honestly, at the level I’m at now, no one would pay me the kind of money I ask for to do an on prem implementation from scratch. It’s not my area of expertise - AWS is.
That being said, for both Resume Driven Development reasons and because we might be able to find people who know what they were doing, if I had to do another on prem implementation that called for it, I would probably lean toward K8s.
God, HN can be so cynical at times. (I'm not really directing this at just you, scarface74, but the overall tone of responses here). Docker and Kubernetes are not just about padding your resume. Why would I not want to use a solution for orchestration, availability, and elasticity of my services?
Why wouldn’t you? Easy: because you probably don’t have enough “services” to make the costs of kubernetes worthwhile.
If you do, then congratulations, you’re in the top 5% of dev teams, and you presumably also have a well-funded organization of people supporting your complicated deployment.
Otherwise, it’s the common case of bored devs overcomplicating systems that can be deployed more
cheaply and safely with simpler technology.
I’m not saying you wouldn’t. I am saying that you get elasticity, orchestration, and availability by using AWS/Azure/GCP and managed services where appropriate and its a lot simpler. I’m not saying the cloud is always the right answer and if I were to do an on prem/colo, I would probably go for K8s if it were appropriate.
As far as Docker, it is the basis of both Fargate, ECS, and CodeBuild in AWS. I’m definitely not saying that it’s not useful regardless.
But why am I cynical? I consider myself a realist, no job is permanent, salary compression is real and the best way to get a raise is via RDD and job hopping.
> Heck even simpler, sometimes is just to use a bunch of small VMs and bring them up or down based on a schedule, number of messages in a queue, health checks, etc and throw them behind an autoscaling group.
1) It will take you longer than you think
2) It will be harder than you imagined
3) It’s harder to find people who know it and non-trivial to get good at it (this should have been closer to the top) your project can be done in six months by five k8s experts but you only found one dude who knows it and he’s more a’ight than pro.
It’s probably still worth it, just go in with your eyes open.
Be prepared for this unfortunate pattern: “this thing I want just doesn’t work and probably never will.
Deploying k8s on a small, self-contained project that you just want to set up and go forever is probably a good place to begin.
If you try to move your whole production workflow in one go...
You’re going to have a bad time.
> 1) It will take you longer than you think 2) It will be harder than you imagined 3) It’s harder to find people who know it
Just so people have an idea of how hard it is to find people. I've got just about 1 year of experience getting kubernetes into production at a very large (multi-billion dollar) company. I have so many job offers coming in that I'm not even talking to companies offering less than 350k total comp. I don't have a college degree and 5 years ago I was making $50k a year. That might be just the generally bubble-ish nature of the tech industry right now, but if they're throwing around that kind of money for someone like me, I imagine that small shops have no way to compete.
Any chance you'd be happy to elaborate on what your role is? Are you primarily part of a dev-ops team, or tackling Kubernetes as part of developing the product? Did you obtain CNCF certification / think there's much value in those?
Also, if _all_ you want to deploy is a small, self-contained project, you don't need k8s. Most small companies/startups just don't need it. Stick with what is simple.
Yuck. We blew a ton of money and time, despite a great infra/hpc/etc team (PhD, exGoogle, exNetflix, ex-scale video, etc), on a skilled but overambitious infra subteam where we should have (and ended on) docker / compose. K8s makes sense if you have a ton of servers or scale or devs, and want to pay employees and team in perpetuity to focus on it, but for half the folks out there, stick with the pieces or you have the cost of openstack and no better off for the increment over just a few of the pieces you can carve out without k8s itself.
The article that leadership needs, vs vendor / fanboy bullshit, is "only do k8s if you have these exact scale and ops problems and your stack looks exactly like this and your entire app team looks like this and you have fulltime k8s people now and forever, and you are willing to tax each team member to now deal with k8s glue/ops/perf problems, and xyz other things , otherwise if any of these simplifying or legacy assumptions apply, perfect: you get to focus on these way faster and cheaper agile orchestration options and actions".
> where we should have (and ended on) docker / compose
We ended up doing this too.
Do you have any links to deploying docker-compose in production? We've not been able to find out much. However our solution seems to work well - am keen to find out how other people are managing host setup, updating and remotely controlling docker-compose
In our setup, we essentially use docker-compose as a process manager:
- Updates with docker-compose.yml via git syncs
- Logging via journald (which in turn is forwarded)
- Simple networking with `network: host` and managing
firewalls at the host level with dashboard/labelling
- Restarts / Launch on startup with `restart: always` policy
IMO, more straightforward in the past
1) Entrusting everything to a package manager or build script which might break on a new version release
2) Maintaining ansible script repository to do the host configuration, package management and updates for you - these too always need manual intervention on major version updates
Fairly similar. Docker/docker-compose takes care of launch, healthchecks / soft restarts, replication, GPU virtualization & isolation, log forwarding, status checks, and a bunch of other things. Most of our users end up on-prem, so the result is that _customers_ can admin relatively easily, not just us, despite weird stuff like use of GPU virtualization. I've had to debug folks in airgapped rooms over the phone: ops simplicity is awesome.
Some key things beyond your list from an ops view:
-- containers/yml parameterized by version tag (and most things we gen): simplifies a lot
-- packer + ~50 line shell script for airgapped tarball/AMI/VM/etc. generation + tagged git/binary store copies = on-prem + multiple private cloud releases are now at a biweekly cadence, and for our cloud settings, turning on an AMI will autolaunch it
-- low down-time system upgrades are now basically launching new instances (auto-healthcheck) + running a small data propagation, and upon success, it dns flips.
-- That same script will next turn into our auto-updater on-prem / private cloud users without much difference. They generally are single-node, which `docker-compose -p` solves.
-- secrets are a bit wonkier, but essentially docker-compose passes .envs, and dev uses keybase (= encrypted gitfs) and prod is something else
Some cool things around GPUs happening that I can't talk about for a bit unfortunately, and supporting the dev side is a longer story.
Some of these patterns and the tools involved are normal part of the k8s life... which is my point: going incrementally from docker / docker-compose or equiv lightweight tooling will save your team + business time / money / heartache. Sometimes it's worth blowing months/years/millions and taxing the folks who'd be otherwise uninvolved, but easily for over half the folks out there, probably 90%+.. so not worth it. Instead, as we need a thing, you can see how we incrementally add it from a as-simple-as-possible baseline.
For our organisation (mid size, varied skill level across the org) Kubernetes solves the problem of pattern drift. There is one way to do things, and when there are multiple ways, we enforce a single one.
An example is databases. We offer Postgres (via an operator that provisions an RDS instance). AWS lets you choose whatever you want, and frankly at our scale and level of complexity, there is simply no reason to choose one SQL db over another other than bikeshedding. So being able to enforce Postgres, and as a result, provide a better, more managed provisioning experience (with security by default and less footguns) and training and support just for one DB type is incredibly powerful.
As Kelsey Hightower has often said, Kubernetes is a platform for building platforms. Just deploying a Kubernetes cluster, giving devs access and calling it a day is rarely the right answer. Think of it as a way to effectively provide a Heroku alternative inside your company, without starting from scratch or paying Heroku prices
Aren't you using a technology to solve a social problem?
I don't have the experience with Kubernetes to have an informed opinion, but I've solved plenty of social problems in software organizations. Those are usually approachable without making a long-term commitment to a technology that may or may not be the right choice.
Good points in here. The one thing about Kubernetes that I think teams need to be wary of is that upgrading needs to be treated as a first class citizen. In order to use the awesome tooling that so many people are building in Kubernetes (see basically all of the CNCF projects), the cluster API needs to be kept up to date. Once you start falling behind, you run the risks of being stuck in a version compatibility nightmare.
Other than that, I can't imagine not running non-stateful applications on something other than Kubernetes anymore.
The one point I keep driving is Kubernetes does so much useful stuff that is a pain to manage otherwise, that it's totally worth the risk currently posed by its rough edges. I'm hoping that came out clear in the article.
Maybe this is the analogy: k8s is now the infra equiv of npm / js frontends, where you need to run to stay in place, and will increase day-to-day strain on everyone. If you're ok with that price b/c your other things are so on fire in a few very specific dev+services at-scale ways, start. Otherwise, give it another X years before you slow down your team(s) & business.
> How do you deal with stateful applications like databases?
The impedance issues between kubernetes and stateful applications have been mostly solved. The main problems stemmed from the object model not yet being mature enough. Deployments did not offer good solutions for assigning stable identities to pods across restarts, or making it possible for them to find and attach to persistent volumes. StatefulSets and PersistentVolumeClaims solve these problems. Most of our k8s workloads are stateless, but we do run some big elasticsearch workloads in it. Having said that, I think if you're a small- to mid-sized business running on the cloud managed offerings are the first place to look. GCP CloudSQL has worked extremely well for us.
To someone who is not familiar with the technology beyond a high-level overview like in OP, the sentence "The main problems stemmed from the object model not yet being mature enough" sounds like "The main problems with stateful applications stemmed from there being problems with stateful applications".
The problem was that these features came a little later after many of the stateless features were in there. Now we're talking about the stateful stuff being stable for many years, so all the arguments against it are usually old or uninformed.
Fair enough. I could have said it more clearly. The strategies, api entities, and implementations to support stable pod identity and declarative persistent storage weren't there a couple of years ago, but now they are.
What about a on prem kubernetes cluster, would you use kubernetes for something like a MySQL database, or run it outside or the container environment and just link to it from the stateless services in the cluster?
Not OP, but my team is running on 100% k8s, mostly on bare metal. We do everything on k8s because of principle of least surprise: When apps are running as Deployments or StatefulSets, everyone knows how to deal with them (how to find logs, how to restart etc.) and it fits in the established deployment procedures (everything is a Helm chart etc.). When you have a single stateful service with a completely different configuration (e.g. RDBMS on a dedicated machine), it's always going to be "the odd-ball that only Greg knows how to upgrade". The whole point of k8s is not having these odd-balls. We have all databases in k8s, with physical volumes backed by an NFS-connected storage that's sitting in the same rack.
> We have all databases in k8s, with physical volumes backed by an NFS-connected storage that's sitting in the same rack.
I'm surprised you haven't seen any major issues with doing this. NFS exercises some dark corners of Linux's VFS, and most people don't run databases on top of it (meaning you could run into a fair few pretty fun issues once you're doing enough transactions). From memory, most database vendors are pretty twitchy when you ask "do you support running $DB on NFS?".
> We have all databases in k8s, with physical volumes backed by an NFS-connected storage that's sitting in the same rack.
I was searching about this yesterday, how is your performance running a database in NFS? If your entire db can fit into the RAM cache it's probably fine, but when you have to do heavy IO is what I worry about with NFS.
I think it depends on your NFS. A storage appliance with NFS hardware offload will work much better than a software NFS.
Personally with bare metal database I'd try to do fibre channel if supported, then iscsi. Would only dabble with NFS if it had hardware offload, but even still I think block storage would have better iops.
AFAIK the reason for using NFS is that it's supported in k8s out of the box. I don't recall any performance issues for the things that we're doing. When we had database issues, it was the usual stuff: missing indexes, suboptimal config settings etc.
Some storage appliances have special hardware to accelerate NFS operations. Those may be able to come close to the same performance as a block storage device.
The other option with NFS is running it in software such as a NFS server on Linux configured with /etc/exports and backing the data to a block storage device. This is CPU intensive and would be much slower under load.
That's an interesting question. Why would you anticipate a possible difference? Is the cluster running a different scheduler? If anything I would expect cloud-hosted managed clusters to possibly have less free resources due to running the pieces of the cloud provider's control plane.
There are bottlenecks in the Kubernetes architecture that start popping up once you cross about 250-300 Pods per node (the published limit in the K8S docs is actually 100).
This is typically more of an issue when running the cluster directly on large bare-metal nodes because in the cloud or on virtualization the host gets split into more smaller nodes from a K8S perspective.
Obviously this also depends on the characteristics of the workload, if you have a lot of monoliths that you have only barely containerized the Pod per Node limit may not be a concern at all even on bare-metal.
I don't recall any problems with pod placement, mostly because the machines are big enough to have ample room left over. We did move our ELK stack to a separate VM-backed k8s to be able to scale it better, but that was mostly because the storage box of the bare-metal cluster didn't have enough space to store more than a week's worth of logs.
Hey do you have any pointers on how to setup databases in kubernetes with nfs? Do you use stateful sets + the kubernetes nfs provisioner, or do you interface with the nfs store using a different storage provisioner?
I'm not into containers (in much the same way the article embraces 'herds not pets' we serverless folks embrace 'cockroaches not herds') but you can connect to DBs from a lambda in <20 ms.
The DB itself is typically provided by the cloud provider and is usually a traditional cluster setup. Is it stateful? Yes, but you could throw it into a river, not know anything about how it was configured, connect another box to the cluster and get the capacity back. Exactly like whatever was running the lambda.
You need to break up your app into stateful and stateless parts, run stateless ones using Kubernetes, and continue running the stateful ones the way you do now.
I work for Buffer have been working on our production Kubernetes cluster.
When we started with Kubernetes we were faced with multiple challenges. Fortunately, with time we managed to tackle most of them. I will list them here and share our learnings.
We just moved 34 microservices to EKS, along with associated migration of datalayer and built out coupling to the few things remaining on-prem. It took about 24 months and cost about $20M, about twice the optimistic estimates. We are beginning to see the force multiplier. As a person who does things, it was a slog, but it seems to have been worth it.
I really dislike the "herds not pets" analogy. It implies lack of attachment due to necessity of slaughter - it also implies lack of attachment to your craftsmanship (as in you care less about each host). I find it jarring every time I read it and think the analogy is fraught and totally unnecessary.
It is a really awkward way to indicate that you need to think of things on a service and resource basis as apposed to a hand-crafted machine basis. Do we need to bring animals into it at all?
It’s a simple term which uses a well known social phenomena as an analogy to describe an architecture. You could have called it Maxwell-Hertz Philosophy of infrastructure design, but then would need to spend a ton of time socializing it. But ... why? Pets vs cattle is easy to understand, you get the concept right away with little explanation, and it’s kinda tongue in cheek. Another term I’ve heard used for it is to treat infrastructure as phoenixes: you can burn it down but get a functioning one right away. But cattle vs pets lets you refer to both kinds of machines.
I’m sorry if the references to animal cruelty (kinda?) makes you uneasy, but let’s be honest, this is CS. Master-Slave replication, parent killing child processes, canary deployments etc all seem distasteful if all jumbled together. Their use is intended as an attempt at humor, but also a catchy and vivid name to easily recall the underlying principles, which is kinda brilliant.
Is Kubernetes all about scaling a service on a cluster of servers? Or is it also used for managing the same applications on a bunch of servers?
e.g. 100 customers, 100 bare-metal servers, the same dockerized applications on each one. I need to deploy a new version of one of these applications on every server.
Or something like Ansible would be better suited and easier for this use case? I never used Ansible but I understood that I could make a recipe that does a git pull and a docker-compose down && docker-compose up -d on every server in a single command, right?
You could use Kubernetes for this- each customer having their own pod/set, all of which are based on the same template. Kubernetes would help make the overall management a lot easier via abstracting the underlying infrastructure (100 individual services is enough for the cost/benefit ratio of the abstraction to kick in).
That being said, a configuration management tool like Ansible would also fit your use case nicely. Ansible in particular is great for the'I need 100 identical application stacks' scenario.
If your application stack is already containerized, you have the luxury of following either path. It really depends on how much overhead you experience from managing the underlying infrastructure. You can imagine the option range as something like "bare individual VMs w/ Docker >>> Managed container service >>> Managed Kubernetes service".
You can use Kubernetes to run service on every machine. But that is usually done for support services like logging that need to run on every node.
Is it one customer per server? That wouldn't be a very good fit for Kubernetes. Kubernetes is about running containers on cluster of machines, scaling multiple services.
Kubernetes would let you run service per customer on cluster of machines and scale the number of containers for each customer independently. You wouldn't have to setup new server for each customer.
I disagree about it not being a good fit. Even if you run one application per host (and if we're honest, it will always be more when we find flaws). You still get the benefits of very predictable deployments, failovers, lifecycle management, etc.
I’m genuinely interested to know when containers should be used. I’m not ignorant and understand advantages - consistent env, run anywhere, fast startup time, etc. But there is also considerable cost involved with increased complexity, educating devs, hiring experienced DevOps, etc
I have over 15+ years of professional development experience and AWS experience and I understand the scale.
Our company has 20-30 instances used by 5-6 services. We use terraform/ansible to deploy infrastructure so deploys are pretty reliable and repeatable. So I am genuinely interested to understand if it’s worth going container route?
I don’t think you should move unless you expect your company will expand a lot. It seems like your scale is fairly modest, your setup works and users are happy. So it’s probably fine.
Advanced orchestration systems like kubernetes are very valuable if you want to do multiple deployments in the same day, across many different services, without losing your mind. It lets you move very very quickly... the time to build a docker container, push it up to a registry and switch your deployment to start using the updated version is ridiculously simple.
So it depends on what kind of shop you are and what your devs want. If they like to iterate rapidly, containers and kubernetes is a well known path to get there.
I find Kubernetes a funny tech experiment. It does not present any new innovation. We first had Docker. It's too hard to use. Let's abstract Docker with an orchestration layer. We have Kubernetes. It's now too abstract and hard to use. Let's create kops to tame it.
I guess this whole experiment is backed by a deep pocket tech giant trying to chip away market share from the other two giants.
If Kubernetes can make us less dependent on AWS, that's a good thing. Until then, I just sit and learn about this silly tech battle. I'm not keen to move production stuff into Kubernetes.
Things that are very general and powerful will always require some level of skill to use. There are easy ways to use kubernetes (look at kubeadm), but you still absolutely need to understand the fundamentals in case you hit a problem.
Whether Kubernetes will really drive up your server utilization levels depends on how effective your Operations staff were at bin-packing and whether your a Kubernetes cluster uses cpu quotas. There are inefficiencies in how the Linux scheduler deals with cpu cgroups that constrain CPU utilization.
It seems like there are a lot of people completely unwilling to accept that K8s can be a solution, because their setup didn’t need it, or they had a bad experience.
Maybe not everyone has needs as circumscribed as yours. Maybe some people really do need to operate at that scale. Maybe they’re not just full of shit.
Just because infrastructure as code appears to use the same tools that developers use (text files, git) doesn’t mean developers can do the job. Me having access to a pen and paper doesn’t make me Shakespeare.
This is especially harmful in an article that claims to be aimed at management, ostensibly trying to set the future path for organizations.
P.S. Otherwise a nice overview of things