
Container technologies at Coinbase: Why Kubernetes is not part of our stack - arkadiyt
https://blog.coinbase.com/container-technologies-at-coinbase-d4ae118dcb6c
======
dvt
> We would need to build/staff a full-time Compute team

This actually was a very real problem at my current job. The data pipeline was
migrated to k8s and I was one of the engineers that worked to do that.
Unfortunately, neither myself (nor the other data engineer) was a Kubernetes
guy, so we kept running into dev-ops walls while also trying to build features
and maintain the current codebase.

It was a nightmare. If you want k8s, you really do need people that _know_ how
to maintain it on a more or less full time schedule. Kubernetes is really not
the magic bullet it's billed to be.

> Managed Kubernetes (EKS on AWS, GKE on Google) is very much in its infancy
> and doesn’t solve most of the challenges with owning/operating Kubernetes
> (if anything it makes them more difficult at this time)

Oh man this hits home. EKS is an absolute shitshow and their upgrade schedule
is (a) not reliable, and (b) incredibly opaque. Every time we did a k8s
version bump, we'd stay up the entire night to make sure nothing broke. We've
since migrated to cloud functions (on GCP; but AWS lambdas could also work)
and it's just been a breeze.

I also want to add that "auto-scaling" is one of the main reasons people are
attracted to Kubernetes.. but in a real life scenario running like 2000 pods
with an internal DNS, and a few redis clusters, and Elastic Search, and yadda
yadda... it's a complete pain in the butt to _actually_ set up auto-scaling.
Oh, also, the implementation of Kubernetes cron jobs is also complete garbage
(spawning a new pod every job is insanely wasteful).

~~~
dvcrn
I work on a 2-person project and decided to go with kubernetes (through
digitalocean) for the cluster. I am managing everything with terraform and I
don't have any big problems. I like that I can write everything as terraform
manifests, have it diffed on git push and applied to prod if I want to.

Sure it had a learning curve but now I just describe my deployments and k8s
does the rest, which then reflects back on digitalocean. If I need more power
for my cluster, I increase the nodes through digitalocean and k8s
automatically moves my containers around how it deems fit.

I used normal blue/green deployments on self-managed VMs in the past, then
worked with beanstalk, heroku, appengine and I much prefer k8s. Yes it's
easier on heroku, but try to run 2-3 different containers on the same dyno for
dev to keep cost down. On k8s I can run my entire stack on one single small
digitalocean $10 VM if I wanted to.

I wouldn't even know what I else could pick that gives me equal flexibility
and power?

~~~
rawoke083600
if you can run everything on a $10 vm...do you really need k8 ?

~~~
merb
most people probably want the following: \- no downtime deployments \-
distributed jobs \- as managed infra as possible

without k8s some things would be hard.

~~~
Juliate
no downtime deployments? happened before k8s; used to do that several times
with some haproxy.

distributed jobs? same; nothing prevents from spawning runners with adhoc
libraries and queues.

managed infra? not specific to k8s

~~~
manigandham
The "adhoc" part is the problem. K8S is standardized and offers high-
availability, failover, logging, monitoring, load balancing, networking,
service discovery, deployments, stateful services, storage volumes, batch
jobs, etc. And it can run self-contained on a single machine or scale out to
1000 nodes.

Why piece all of that functionality together yourself into some fragile
framework instead of using a industry standard?

~~~
spottybanana
"Why piece all of that functionality together yourself into some fragile
framework instead of using a industry standard?"

Quite recently developed "industry standard". Many tools mentioned have been
used for tens of years, they work robustly, are well documented and there is
lots of people who can use them. I personally would use the word "industry
standard" a little bit differently.

~~~
cameronbrown
> Quite recently developed "industry standard". Many tools mentioned have been
> used for tens of years, they work robustly, are well documented and there is
> lots of people who can use them.

k8s is based off of Borg, which has existed for far longer.

[https://research.google/pubs/pub43438/](https://research.google/pubs/pub43438/)

~~~
vertex-four
"What Google does" is not an industry standard.

~~~
cameronbrown
That's not what I'm saying. Your point was about the maturity of k8s which
makes sense if it sprung from nowhere, but k8s encapsulates a lot of "lessons
learned" of a very stable and mature product (even if proprietary).

------
theptip
Anecdata: Series B startup. I've found GKE to be almost completely painless,
and I've been using it in production for more than 4 years now. I don't think
the article gave a fair representation on this count; sharing a link to a
single GKE incident that (according to the writeup) spanned multiple days and
only affected a small segment of users doesn't (for me) substantiate the claim
that "it isn’t uncommon for them to have multi-hour outages".

In my experience, multi-hour control-plane outages are very rare, and I've
only had a single data-plane outage (in this case IIRC it was a GKE networking
issue). Worst case I see is once or twice a year I'll get a node-level issue,
of the level of severity where a single node will become unhealthy and the
healthchecks don't catch it; most common side-effect is this can block pods
from starting. These issues never translate into actual downtime, because your
workloads are (if you're doing it right) spread across multiple nodes.

I wouldn't be surprised if EKS is bad, they are three years behind GKE (GA
released in 2018 vs 2015). EKS is a young service in the grand scheme of
things, GKE is approaching "boring technology" at 5 yrs old.

~~~
whatsmyusername
EKS is a feature parity product. The pricing makes that painfully obvious.

~~~
AaronM
We've been using EKS since it went GA. We haven't had a single control plane
outage that I am aware of.

~~~
whatsmyusername
I'm sure it works fine if you can get it running (documentation didn't work
when I played with it). I'm referring more to the $200/month it was per
control plane.

That to me is a product they offer because someone else is offering it, but
they don't want you to actually use it.

~~~
amerine
I take it the isolated $200/control plane is too expensive. What do you think
the right price point is for that isolation?

~~~
whatsmyusername
ECS control planes cost nothing.

If you're actually looking to build isolation in AWS then you're going to need
EC2 dedicated for your EKS member hosts. So you're not getting isolation for
$200/month (I'd have to spec it out but dedicated hosts are pricey to the
point that it'd be competing with physical hardware in a colo).

That $200/month for not-really-isolated is also per plane. So if you want a
separate staging environment it's another $200/month. Client API sandbox? Same
thing. It's wack.

------
DevKoala
I work for a mid size company with 30-40 engineers managing 20-30 very diverse
apps in terms of scale requirements and architectural complexity. It took our
devops team (4-5 people) probably 18 months to learn and fully migrate all our
apps to Kubernetes. The upfront cost was massive, but nowadays the app teams
own their deployments, configurations, SLA's, monitors, and cost metrics.

Introducing Kubernetes into our org allowed us to do this; we would have never
gotten here with our legacy deployment and orchestration Frankenstein. The
change has been so positive, that I adopted Kubernetes for my solo projects
and I am having a blast.

I understand Coinbase's position, and they need to stick to what works for
them. I just wanted to bring up a positive POV for a technology I am becoming
a fan of.

~~~
ec109685
One other advantage of Kubernetes that is overlooked in the article is the
benefit with a heterogeneous cluster of instant auto scaling. For example you
have 10 apps on a k8s cluster that each use the same resources, you can give
20% buffer for the cluster, which would let any single app use 300% of their
allocation instantly. With VMs, you’ll be stuck waiting for VMs to spin up or
have to give each app their own large buffer to handle bursts of traffic.

~~~
amerine
It amazing to me how many people associate this with kube. You’ve had this for
years in numerous ways, the only difference is some yaml and marketing.

~~~
jayd16
What other solution is cloud agnostic as well as has an easy local dev story?

~~~
pg-gadfly
_any_ non-proprietary tool is "cloud agnostic". Kubernetes is bundled
software, and achieves the things those pieces achieve. There is nothing holy
about k8s specifically, the tools you train with are easy, and it's very easy
to get skewed opinions on that.

For example a lot of people would find writing scripts cumbersome, but not a
person who's written a lot of them. They're not any more fragile than other
logic error capable software is

~~~
kryptk
You didnt avtually answer the question: What free, cloud agnostic tools let me
specify "keep 60% cpu load average" and work the rest out?

With k8s, a horizontal autoscaler is a few lines in a yaml file and the result
works in any cluster run by any vendor.

~~~
pg-gadfly
Any load balancer will let you do that. Scaling is a few lines of scripts to
do on any platform, and well worth the couple minutes, you don't even need a
tool for just that.

No tool "works the rest out". It's _always_ a compromise, because inherent
complexity can never be removed, only moved. What you gain in one area,
encumbers another.

You may freely use k8s, but it's not magical nor easier to use than any
existing systems. In fact adopting it often takes non-trivial time and the web
is full of failure stories with very benign warnings and catastropic results.

------
Shicholas
As someone whose gone the opposite way (moving from ECS to Kubernetes), I
think the author is understating how good managed Kubernetes solutions are.

At my current job, I use Azure's managed Kubernetes service, which does a
great job at providing a consistent environment that's very easily managed, no
unexpected updates, great dataviz, and if you choose, simple integrations to
their data storage solutions (run stateless K8 clusters if you can) and key
vault. We don't do much outside of our kubectl YAML files, which as commented
below has a de-facto understanding by a large number of people.

CVEs will always exist, which is why network security is important. I think we
can agree that the only ingress into your cloud environments should be through
API servers your team builds, and everything else should be locked down to be
as strict as possible (e.g. VPNs and SSO). With a system like K8, so many eyes
on the code mean so many more CVEs will exist, so I don't find this argument
compelling.

My team, and so many other teams worldwide are betting that the K8 community
will accelerate much faster than roll-your-own solutions, and K8 gives us the
best opportunity to create cloud-agnostic architecture. Additionally, helm
charts are easy to install, and afaict more software vendors are providing
"official" versions - which means for a team like mine, which is happy to pay
for services to manage state, in the same vain a company chooses AWS RDS over
managing their own Postgres server, we can get the same benefits as the author
with a cloud-agnostic solution.

~~~
speedgoose
You don't see random network errors, often visible with DNS, on your Azure
managed kubernetes clusters?

~~~
Shicholas
I haven't yet our ingresses have passed routine ping checks (we use New Relic
synthetics for this) for a while now. Fingers crossed.

~~~
folmar
Ingress on AKS is easy, the outgress will be the pain if you need anything
from it.

------
nemothekid
One thing that is regrettable about K8s winning the orchestration wars so
remarkably, is that it pretty much killed all other solutions. Swarm is dead,
Nomad doesn't seem like it has much community support and Mesos _feels_ like
it's on life support. Mesos still has a lot of people working on it however,
but the perception feels different.

Personally I've found Mesos much easier to manage, secure, and operate than
k8s. However, when it first came out all the cool kids were using it, then
most of them jumped ship to k8s. AirBnB's Chronos is now pretty much a dead
project, Mesosphere's Marathon is now gimped (no UI) and major features moved
into DCOS. At the same time, Mesosphere (now D2iQ?), now seems like it's more
focused on k8s.

k8s is everything plus the kitchen sink, and managed k8s isn't the killer
feature I thought it would be. I don't blame people for not jumping on the k8s
bandwagon at all.

~~~
mpfundstein
Nomad, Consul + Swarm is still a lovely solution and I prefer it a lot to K8s.
K8s is a big monolith and often way too complex for my personal use cases. I
hope Hashicorp sooner or later builds a proper replacement for Swarm so that
we can have overlay networks without hassle. I know there is Weave, but never
tried it.

~~~
tie_
K8s is a lot of things, but a monolith it is not. Quite the opposite - the
complexity comes from the _large number_ of relatively simple components
interacting in various ways.

------
wildermuthn
This article was written to address what seems like an internal debate or
discussion about why they don’t use Kubernetes at Coinbase. As such, it boils
down to: we already use something else, and moving to K8s comes with risks and
challenges that outweigh the benefits, particularly security concerns. As a
crypto firm, Coinbase rightfully seems zealous about security.

The article doesn’t claim that K8s is a bad solution. It only claims that
migrating to K8s doesn’t work for them at this point in time.

If you are starting something new, or rebuilding a system entirely, and if you
are on GCP, and you anticipate the need for scale, then K8s/GKE is a sane
choice. What would be insane is trying to roll your own solution. There’s no
avoiding the complexity of managing infrastructure, but at least with K8s you
won’t be alone, and you won’t make the mistakes others have already made and
fixed. Some people never seem to get K8s, in the same way some people never
get functional programming. You might be one of them, so that’s something to
take into account.

A lot of people have had bad experiences with K8s, but just as many have had
bad experiences with Docker, or with Linux, or with computers in general. This
doesn’t mean the technology is bad. It just means that not every good
technology will work for you or your use-case. Kubernetes is a solid choice,
especially GKE. But think carefully before deciding to use it. It is one of
those tech decisions that will dominate the rest of your choices for years. In
my case, it was a decision I‘ve never regretted.

------
kureikain
Looks at what they have built:

\- container orchestration platform is Odin + AWS ASGs (auto-scaling groups).

\- Codeflow (our internal UI for deployments)

\- Odin kicks off a step function and begins to deploy your application. New
VMs are stood up in AWS and loaded into a new ASG, your software is fetched
from various internal locations, a load balancer starts health-checking these
new instances, and eventually traffic is cut over in a Blue/Green manner to
the new hosts in the new ASG behind the load balancer.

\- To handle secrets and configuration management we have built a dynamic
configuration service that provides libraries to all internal customers with a
p95 of 6m

\- re-scheduling/moving of your containers if your VM dies/becomes unhealthy
in your ASG

So if your company has all of that, I agree no reason to use Kubernetes. But
what if your company doesn't have any of above system? Kubernetes.

I have used bare ec2 vm, deploy with Ansible, Capistrano, Chef, Docker..all of
them. I even roll my own autoscaling with Consul, SQS(for termination notice)
and I have more downtime than when using Kubernetes.

With Kubernetes, the learning curve is very high, but once you get it. It's
painless to bring in new service, with AWS ACM to terminate TLS, and an
ingress controller, pretty much as long as you have a Dockerfile, you can run
it.

~~~
nihil75
Odin is just Step Function flow that deploys Auto Scaling groups. It's not
that complicated, and not so different from your setup.. main difference is
they deploy docker images to ec2 rather than run ansible and configure a
running machine.

~~~
kryptk
What about resource allocation? With Kubernetes I can specify both requests
and limits of cpu and memory and it will fill my nodes to match. A node
autoscaler gets me more nodes if needed. I can define vertical pod autoscaler
that can dynamically modify those requests and get me an emptier or bigger
node if needed. I can define a horizontal pod autoscaler to keep aggregate cpu
at a set target by auto-spawning more containers for me, and k8s handles load
balancing via dns. Do typical production setups not require some/most of these
features? Mine do.

~~~
asdfaoeu
Except for the vertical scaling all of that can be done with asgs/elbs.

------
random3
I dropped midway through the post. It reads as a classic engineering
justification for X vs A, B or C (the same justification can be made for
anything vs anything).

I found it interesting that for someone that doesn't use Kubernetes they spend
a lot of time describing it.

There are benefits in using mainstream solutions like Kubernetes. I've spent
over a decade building distributed systems from Hadoop to Mesos and Kubernetes
and have seen the pains of datacenter and AWS, Azure and GCP and all I can say
is GKE works great so I don't buy the simpler in-house argument. Once you
start properly integrating concerns that need to be end-2-end integrated, the
complexity explodes and you end up with a partially working system.

I do believe that serverless will likely make Kubernetes and similar-level
tech irrelevant for most users, however only part of a rich ecosystem that
provides all other concerns.

------
rudolph9
Having used Kubernetes extensively at my last position, adoption of k8s
strikes me as similar to the adoption of Linux as your desktop OS at the Turn
of the century. Some people are doing amazing things with it! Other never
figure out how to get their WiFi to consistently work and are bitter people
keep talking about it.

Eventually something akin to Ubuntu will grow up in the k8 ecosystem and
people will stop complaining that WiFi doesn’t work.

~~~
pestaa
Interesting analogy!

I was very much looking forward to OpenShift/OKD to be "the Ubuntu" of K8s,
but their targeted scope just keeps getting bigger (or it always has been big,
not so sure anymore).

On the other end, K3s from Rancher is a minimalist distribution, and I think
it'll limit its adoption in sophisticated environments with larger teams.

Does any other K8s distro look like a good candidate to standardize on in the
future?

~~~
rudolph9
[https://jenkins-x.io/](https://jenkins-x.io/) has brought together a number
of open source projects in theoretically a good way but it's still pretty
rough around the edges (at leas last time I worked with it in March) and still
falls short of an "Ubuntu" experience.

------
rudolph9
One thing people forget is containers don’t necessarily not to be created with
Docker.

Lately I’ve been creating containers using NixOS and couldn’t be more happy
with the ability to have everything in the container configured by a
configuration.nix file. [https://nixos.org/nixos/manual/#ch-
containers](https://nixos.org/nixos/manual/#ch-containers)

The idea of a Docker Ubuntu, Arch, Alpine, etc base image is kind of silly
when you think about it. The idea that we have separate repo for sharing
massive container images sometimes multiple GB rather that a simple file that
can be tracked in git and will produce the same image _every-time has been
very eye opening.

_ Note: NixOS isn’t perfect but it has a growing community and hands down the
best thing out there except for maybe guix which has container support too
[https://guix.gnu.org/blog/2018/tarballs-the-ultimate-
contain...](https://guix.gnu.org/blog/2018/tarballs-the-ultimate-container-
image-format/)

~~~
root_axis
How is a configuration.nix any different than a Dockerfile?

~~~
rudolph9
Dockerfiles are imperative and the order of steps changes the container image
that ultimately gets built.

Nix packages are declarative and idempotent. Everything is based on a
functional package manager and when you configure your container via nix you
get the same thing every time and can easily mix and match other packages and
dependencies.

A contrived example is you have a program that depends on multiple versions of
python, because nix does not globally set dependencies rather links them
locally you can have both with versions of python without any conflicts

Containers built from Dockerfiles are difficult to exactly reproduce too. A
very common practice is to update the package manager in which depending on
what the server returns will be different from one day to the next where as
nix hash addresses and builds from source everything (you can use caches to
speed this up) and you get the same thing every time.

Note: NixOS isn’t perfect in terms of being purely functional as many package
builds use scripts in pearl,bash,etc but from my understanding these are
usually source specific reproducibility issues with Building particular
packages and generally speaking things works as expected.

~~~
jmchuster
Then have your Dockerfile build from source. Many of the public layers that
people build their containers upon do that for that exact reason.

If NixOS is only "generally speaking" correct because of the order it runs its
scripts, then put it in a Docker container, where the whole point is to
perfectly reproduce the order you run your scripts in order to minimize the
errors that arise from doing otherwise. Still worried? That's why layers
exist, so that every step you run you now have a static binary that you know
will never shift under you, and that you can keep on building on top of.

~~~
smilliken
In practice, building Dockerfiles is unreliable and can produce errors because
they inherit all of the deficiencies of aptitude, python packaging, javascript
packaging, et al.

In practice, building nix dependencies is reliable and reproducible because of
the militant isolation of build environments, hash integrity checks on _all_
inputs, explicit dependencies, and the long tail of bug fixes related to non-
determinism in unix tooling.

This is my personal experience, not abstract.

~~~
jmchuster
That's a great a vote of confidence. I would've thought that nix would only
manage packages on the OS-level. How does it re-implement pip to be reliable?
I currently have to stage a couple of `apt-gets` and `pip installs` before my
`pip install requirements`, and it took a couple of tries to lock in the right
sequence that worked. How does nix solve those problems?

------
felipelemos
What I don't like in these kind of posts is the fact that he seems to know
much more than he could.

"For example, most folks that run large-scale container orchestration
platforms cannot utilize their built-in secret or configuration management."

So, with thousands (maybe more?) of "folks" out there, each with a different
environment, requirement, background and so on. How can someone knows what
most of them can or cannot do? Sure there is a lot of them posting about it in
blogs and showing in conferences, but it still a very small subset of all the
companies running containers today in the world, and the majority will not
expose any info about it, for a myriad of reasons.

~~~
jmchuster
Probably by "most" he just means he's talked to 20 friends who head up devops
for such large-scale platforms, and 19 out of 20 explained how it didn't work
for them, and he then assumed that's generally applicable for the other couple
thousand out there.

------
ur-whale
Given Coinbase's reliability track record [1][2][3], I'm not entirely sure why
I'd listen to devops "wisdom" coming from their corner.

[1]
[https://www.reddit.com/r/CoinBase/comments/gh3b5t/coinbase_c...](https://www.reddit.com/r/CoinBase/comments/gh3b5t/coinbase_crashes_every_time_there_is_a_massive/)

[2]
[https://www.reddit.com/r/Bitcoin/comments/6gtwyi/every_singl...](https://www.reddit.com/r/Bitcoin/comments/6gtwyi/every_single_time_price_falls_coinbase_goes_down/)

[3]
[https://www.reddit.com/r/CryptoCurrency/comments/ggr80l/i_do...](https://www.reddit.com/r/CryptoCurrency/comments/ggr80l/i_dont_understand_how_a_platform_goes_down_every/)

~~~
modeless
While this may not be stated in the most diplomatic way, it's important and
relevant for this discussion and I don't think it should be downvoted. It's
very well established that Coinbase's infrastructure is unable to handle
traffic spikes, despite the fact that traffic spikes are a fact of life in
their industry and downtime can easily cost their customers millions of
dollars.

~~~
EdwardDiego
Huh, have they considered Kubernet... nevermind.

In seriousness, we find k8s makes handling our traffic spikes a doddle. As in,
we don't have to do anything.

Of course, we didn't necessarily need K8s to accomplish that, but it's the
tool we used, and the complexity cost is paying off for us.

------
darkwater
It puzzles me a bit that they don't want to invest on.k8s expertise (at their
scale, especially security-wise can be though, I guess) but at the same time
they develop their own deploy system which sounds a lot like Spinnaker and
they have their own secrets/config management system (Hashicorp's stack is
pretty neat and battle tested).

~~~
yahyaheee
Works great! ... till it doesn’t

~~~
threeseed
More likely that it works great until the lead developer leaves.

And then you're stuck with a proprietary solution that lags further and
further behind established, open source projects.

------
EdwardDiego
> The only way to sanely run Kubernetes is by giving teams/orgs their own
> clusters

Funny, we're running it sanely without doing that. We've separated our
clusters based on use-case - delivery vs. back-end, aiming towards the "cell-
based" architecture.

> Managed Kubernetes (EKS on AWS, GKE on Google) is very much in its infancy
> and doesn’t solve most of the challenges with owning/operating Kubernetes
> (if anything it makes them more difficult at this time)

...some details on the challenges they don't solve, or indeed make more
difficult would be good.

But yep, K8s is complex. So, to paraphrase `import this`, you only want to use
it when you have sufficiently complicated systems that the complexity is worth
it.

~~~
djsumdog
It will catch up with you. I was at one shop with an 11 person platform team
dedicated to platforms. The shop moved really fast, and even with 500
employees, they were able to moved from OpenStack to DC/OS in 2~3 months. (We
had CoreOS running on open stack, but fully migrated over to DC/OS. Jenkins ->
Gitlab also happened very rapidly; really good engineers).

At my current shop, we struggle to maintain k8s clusters with an 8 person
team. We inherited the debt of a previous team that had deployed k8s and their
old legacy stuff was full of dependency rot. We have new clusters, and we
update them regularity, but it's taken nearly half a year so far and we don't
have everything moved over.

You do need good teams to move fast; and good leaders to prioritize minimizing
tech debt.

~~~
EdwardDiego
We've used a GitOps model (using Flux[1]) with a reviewer team made of people
across our dev teams (and the sysops, natch) to ensure that people aren't just
kubectl-ing or helm installing random crap, and we put about 2 weeks of effort
into getting RBAC right, so that everyone has read access to cluster
resources, but only a subset (generally 1 or 2 per team) have what we call
"k8ops" roles - and those are the same people reviewing pull requests in the
Flux repo - and the norm is to use the read-only role as default. Only time
I've had to recently use my k8ops role was to manually scale an experimental
app that was spamming the logs to 0 replicas so the devs responsible could
sort it in the morning.

I think the way we've approached it achieves the same goal as just giving each
team their own cluster to avoid them messing up other teams.

[1]: [https://docs.fluxcd.io/en/1.19.0/](https://docs.fluxcd.io/en/1.19.0/)

------
maxdo
1) Those guys can't imagine their life outside of amazon. this is bad. And
yes, amazon k8s was bad, it's getting better, still bad. 2) They said we need
a separate team and instead they wrote and maintain their own, sort of
lightweight solution. Nothing more about that article. One opinionated team
decided to roll out something homemade to add it to CV later on.

~~~
techntoke
Something that won't even help their resumes in the long run.

------
Karupan
I partially blame keyword driven recruitment in tech for these kinds of
responses to a platform/tooling. Kubernetes isn't a magic bullet - it is a
platform which solves a very specific set of problems with scaling. And of
course it doesn't come for free. You can't just _throw in_ k8s into your
existing infrastructure and expect your devs to manage it, in addition to
their regular work.

And yet, we keep reading about teams falling into the trap because their lead
engineer wanted to put "production kubernetes" on his resume. I hope the k8s
team adds a huge "Who is this NOT for" disclaimer to their docs (if it doesn't
already exist).

~~~
systematical
Oh God. This is what is happening at my work. We have an API that has 200
write users and a public front-end that can do reads. None of it is heavy
though, with most writes occurring for a month in the Winter and a month in
the Fall. In the unlikely event of heavy write loads we could just scale up
CPU/RAM for those two months. Any read load be solved with cache or spending
time on the worst offenders in SQL. The Lead Dev is gung-ho that it MUST be a
micro-service with K8, Kafka, and I'm sure a bunch of other shit we don't need
for what is the same application that has been being written for half a
century. Data in/out with business logic applied. The entire API has about 8
paths, with your basic HTTP methods for each. Its probably the smallest API
I've ever worked on.

The positive for me is I am learning a bunch of stuff during the process. The
negative for the project is he expects developers with no previous skillset in
this space to design all this new (for us) tech without introducing technical
debt. The downside for the client is...well guess.

It's sad. I am rewriting a problematic legacy application and creating a whole
new one. I guess I can put K8 and the kitchen sink on my resume when I look
for a new job just before this thing implodes on release though. I just wanted
to write clean code man, thats all. There is a sadistic part of me that wants
to grab some popcorn and see how this explodes in his face...there is also a
part of me that wants to be proved wrong...I hate this job.

~~~
harpratap
So are you assuming that your service will never grow? Because vertical
scaling up is nice and easy as long as you know the limits but once you cross
a threshold, not matter how many CPUs you throw at a problem it just won't
scale. Your senior lead seems to be anticipating that and preparing in
advance.

~~~
systematical
Who assumes a system will never grow? No one. I'm looking at the 10-year-old
legacy system we are replacing. That tells me a lot. Where its been, why its
that way, and a lot about where it's likely to go in the next 2-3 years. Its
function is actually being reduced in this rewrite.

I'm not against the microservice idea. I just rather focus on solving the
problems the legacy system had. None of those had to do with scaling. They
were related to a certain federal agency changing their mind every 2-4 years
and really poor coding practices. A microservice with kafka and k8 don't
really solve those issues.

------
vegai_
Nomad has been brought forth in a lot of comments, and it has a feature that
nobody brought up yet, I think. It's multi-platform. It currently has official
task runners for Docker, Isolated/Raw Fork/Exec, Java, and Qemu. It has
several community-based task drivers, including Windows IIS and FreeBSD Jails.

Kubernetes supports mostly Linux, although has recently gained Windows node
support.

~~~
TurboHaskal
I am extremely happy with Nomad, it is one of the best decisions we have ever
made in our organisation. We had a thin abstraction layer over Consul + Docker
Swarm that we could port to Nomad in a matter of a few hours and it's been
rock solid so far.

But sometimes I wonder whether we are hurting our own careers not going down
the k8s route. It really seems it has a bright future.

~~~
vegai_
>But sometimes I wonder whether we are hurting our own careers not going down
the k8s route. It really seems it has a bright future.

I believe the world will always have a demand for simple technology.

~~~
TurboHaskal
True that, but I may be in the job market soon, and all I see is Kubernetes
and a plethora of tools related to its ecosystem I have no exposure to.

------
zapf
My personal mantra of being a late adopter when it comes to cloud deployment
tools has some benefits has served me well. I am even reluctant to integrate
docker in my work flow. Git pull, build and deploy bash scripts are serving me
well enough for now. Thank you very much.

------
theptip
One note on the security side of things -- if you're interested in seeing what
a truly hardened k8s/GKE configuration looks like, check out the Vault
examples:

[https://learn.hashicorp.com/vault/kubernetes/k8s-reference-a...](https://learn.hashicorp.com/vault/kubernetes/k8s-reference-
architecture)

[https://github.com/sethvargo/vault-on-
gke](https://github.com/sethvargo/vault-on-gke)

In summary, for your security-critical workloads you're going to want to put
them in their own cluster; treat k8s in this case as an API for updating the
code that's running on your VMs. (Except your VMs can run a stripped-down
read-only OS like Container-OS or CoreOS).

------
rossmohax
Pinnacle of devops effort to deliver apps before Kubernetes was AWS Beanstalk,
which their setup replicates in great detail.

I'd take k8s over Beanstalk any time.

Kubernetes is not about scale, it is about defining primitives, everybody can
use to describe their setup. It is a DSL everyone converges to and as a result
of this unification products from different vendors can be packaged and
deployed in the uniform way.

------
richardARPANET
The CTO at a client of mine has spent over 1 year trying to deploy ~7 web
services to k8s, hired 5 contractors to help and they still don't have a
stable deployment.

I informed them my team could do it all in under two days using Terraform +
AWS + EBS. Unfortunately, they didn't take us up on the offer.

Sunk cost, etc.

I've never seen a company use k8s and not end up with a giant expensive team
maintaining it all.

------
maps7
My company is about 10 years behind so I'm not using K8s yet. Hopefully all
the issues will be sorted by the time I get to it.

~~~
Mandatum
I honestly think treating every company except R&D or startups like this makes
sense. I've been playing firefighter at F&I orgs who've stood up k8s recently,
and it's insane the amount of debt they've taken on in such a small amount of
time. The infra spend is magnitudes greater than it would be if they had
physical boxes, because getting approval for physical boxes is hard. But for
some reason budget on AWS, GCP or Azure is easy.

------
eddieoz
The discussion and pressure always start on the engineers' side, but normally
they don't consider it is different to build a system from scratch using the
technology or migrating.

The article is clear: Coinbase has a very strong, tested and validated
infrastructure and moving to Kubernetes 'just to be part of the hype' does not
bring any benefits at this point. And a statement like that is a nightmare to
some DevOps.

------
pstrateman
This isn't really about the article, but why does it take 6 seconds to load
the page.

~~~
SilasX
\- Hypertext blog post with some pictures that takes six seconds to load.

\- Exchange that crashes during any time of high volume.

"Here is why, in our superior engineering judgment, you should do things the
way we do in order to preserve uptime."

(To answer you question, probably because it uses the Medium code.)

------
markbnj
Many of the author's issues with kubernetes are specific comparisons with
their current workflow. One I can definitely agree with is the burden of
upgrades and keeping the platform current. There are techniques which can make
that easier to handle. One statement that surprised me:

>> at Google it isn’t uncommon for them to have multi-hour outages with GKE

For what it's worth we've been running multiple GKE clusters in production for
over three years. We're medium size, with some dozen or so in-house services
handling perhaps a total of 20k rps. We are rarely affected by any GCP issue,
and as far as I can recall we have never been down due to a GKE-specific
problem. In addition to the basic orchestration features we make significant
use of ingress and storage primitives. It all quite literally just works.

------
calt
Their system seems very sane. I'm jealous, and I hope it stays that way (sane,
not static). It's also entirely dependent of features of AWS. I used to work
for a company that had a very insane hybrid datacenter/aws deployment
environment, and containers provided some sanity.

Service discovery, cluster management, secret storage, these are all problems
that we _already_ had. Containers (for us on mesos) just solved some parts of
that picture.

I lived and breathed containers, distributed systems, config management,
deployment pipelines etc. for years, and I forget that to many K8s is just
seen as one magic bullet solution. You will have to pick it apart and interact
with pieces of it if you really want to use it at a medium sized company. That
takes a lot of research and understanding.

------
kelseyhightower
Coinbase built and maintains their own platform that's working for them.

Coinbase provided an analysis worth studying. The major takeaway for me:
asking people to manage their own Kubernetes cluster is like asking people to
manage their own hypervisors when they just want VMs.

------
daitangio
Question: some of the security/management concerns cannot be addressed via AWS
Fargate?

The bottom line is: K8s is awsome but it is complex. It is a Google-needs-
oriented software. Sadly average project will not have that complexity you
find in Google Plex.

I can run Oracle Database in 100MB of RAM...you need 4GB at least to run a K8s
node... a costly option. Below ten servers I find no point in K8s: I highly
suggest docker swarm.

------
caiobegotti
So the gist to me seems to be that they have such vendor lock-in with AWS that
they don't need to bother with Kubernetes today. Fair enough to me.

------
Fizzadar
I'm a big fan of this - to me containers is another hype train like cloud and
serverless. There absolutely are good use-cases (we run a number of K8s
clusters with great success at work) but it's absolutely not the solution to
all (app deployment) problems. As said in the article, I think complex systems
like K8s can often have a detrimental effect on productivity. KISS.

------
d0m
Quick shout-out toAptible (Heroku for HIPAA) which have been amazing for our
healthcare startup.

Still, I honestly feel that we've taken a step back in the industry where we
went from ec2 --> heroku --> k8s. But I know there are many people working
hard to create the next infrastructure so we don't have to deal with
containers and all that re-inventing dev-ops non-sense for every project.

~~~
chillfox
Heroku is missing the ability to get a dedicated IP for a project.

~~~
httpz
You just have to pay more to get private spaces on Heroku.

~~~
snuxoll
Pay more is the Salesforce motto.

------
ulzeraj
Coinbase service is known to melt every time there is a higher volume.

------
thedevopsguy
I find it interesting that many people seem to conflate the complexity of
managing infrastructure and services with K8s.

K8s is complex because managing distributed services is. Not using it doesn't
mean it goes away. The complexity migrates and ends up being bundled up in a
separate tool or a runbook process or some script.

It's hard to maintain because the tools and apis are different from what some
engineering teams are accustomed to using. Building an in-house tool gives
them a warm fuzzy feeling and comfort that they can handle problems when they
appear due to familiarity with their own code and design choices.

It's a fair trade off. I do wonder how much of the time spent doing this
exercise could have been spent on K8S training.

I do feel that the K8S community do downplay how much a PITA k8s configuration
can be and that the perceived robustness of cloud-managed K8S isn't up to
scratch for something this complex.

------
vbernat
The history is a bit fuzzy. The interesting feature introduced in 2.6.24 was
PID and network namespaces. Containers were "complete" by Linux 3.8 with user
namespaces. Cgroups are not that important to be able to build containers
(isolation first). There were other out-of-tree technologies before that,
notably VServer and OpenVZ.

~~~
davidpaulyoung
Where are Solaris containers in the "history"?

------
gigatexal
Imo: the best way to run K8s is via some managed provider. Sure you can do it
the hardway a la Kelsey Hightower but in production and if you’re not staffed
with k8s experts I’d rather give that to a provider and focus on what I know —
the code and biz logic.

------
vandalatyou
It's all about scale. Large scale requires an orchestrated application
management system with progressively integrated bells and whistles and minimal
fiddling with userspace. Small scale doesn't factor in. Do small scale the old
way if you want: Binary artifacts, amis, home grown integrations and
playbooks/scripts. Having done it both ways I continue to do it both ways. Run
a monolith the old way and run your web tier, stateless and customer facing,
highly available services in k8s. If it's not broken don't fix it (but for
some use cases (large scale) the old way _was_ broken).

------
nwmcsween
I'm currently going through this but the alternative is just too painful, I've
tried nomad, Deb packages, etc but the tooling around all this is basically
build-your-own vs kubernetes which has tooling for a lot of things.

------
stock_toaster
Here[1] is a better link to the "jails > FreeBSD" paper (aka, "Jails:
Confining the omnipotent root").

I believe it initially shipped as part of FreeBSD 4.0 in March 2000.

Wikipedia says[2] it was in use in the late 90s in a hosting provider before
inclusion in FreeBSD though. Interesting! TIL.

[1]: [https://papers.freebsd.org/2000/phk-
jails.files/sane2000-jai...](https://papers.freebsd.org/2000/phk-
jails.files/sane2000-jail.pdf)

[2]:
[https://en.wikipedia.org/wiki/FreeBSD_jail](https://en.wikipedia.org/wiki/FreeBSD_jail)

------
mickeyben
Thanks for the article. I've been in tech for the past 10 years, working in or
around devops teams for the most part but I don't get all the fuss about k8;
yes it's an amazing tool doing a lot more than any other.

But it has a big learning curve and setup & maintenance are very costly. I
don't understand why most orgs are moving to k8 considering this. When talking
to my peers I often see numbers like 6 months to 2 years full migration with a
very small added value - at least for 90%+ of the companies using it.

It usually boils down to attracting talents and keeping them excited trying
the new shit.

~~~
pm90
There is a learning curve either way. It’s either an in house container
orchestration platform or one of the open source ones. Coinbase seems to have
chosen the former.

K8s is simply a good default environment which provides rock solid stability
for your applications by outsourcing the distributed systems complexity to
your infrastructure team (whether its internal to your company or to a managed
one like GKE). Teams are not using it just because it’s “cool” (maybe some
are), there is no need to develop in house strategies to deploy and keep an
app running and scale it (among other things; this is the lowest common
denominator).

It’s the same reason why big data tech has somewhat standardized on a set of
tech (spark, airflow etc): once people learn the system, they can focus on
building products that provide value rather than building the products and the
relevant infrastructure.

------
dpenguin
Can fully relate to the author. We have been struggling with effective k8s at
our company for 2 years now. Too much to learn to get your first service in
production. You will end up writing wrappers after wrappers that make you
think “we can’t be the first guys to solve this” and after googling, you find
that every issue you find is described as “just” one of the downsides of using
k8s and here’s how we overcame it.

I wish someone can make a deployment orchestrator for your private DC that is
as simple to use as Heroku is. Edit: fixed typo.

~~~
erulabs
> I wish someone can make a deployment orchestrator for your private DC that
> is as simple to use as Heroku is.

This is what I'm working on, although it is built on top of Kubernetes! We
feel Kube suffers from a disconnect between the insane complexity requirements
of enterprise deployments and what most coders and businesses actually need -
which is exactly what you said: a consistent Heroku-ish experience on their
own hardware or whatever cloud is convenient. Kube needs what Git needed - a
GitHub. Feel free to reach out to me (email in profile), I'd be happy to chat
your ear off about it!

~~~
harpratap
I don't think there is a disconnect, the K8s team knows about it but this is
not a problem meant to be solved by K8s alone. Project like Knative, OpenDeis,
Fargate, Cloud Functions are supposed to provide Heroku-like PaaS on top of
Kubernetes.

------
rb808
We haven't even got to ops stage yet. Small team has spent 6 months trying to
repackage our app into openshift and still barely works. I think everyone
regrets even looking at it.

~~~
freedomben
What sort of challenges have you run into? What are some of the things the
"small team" have done for the last 6 months?

~~~
rb808
Networking mostly, RMI, JMX. Its our internal rules too, getting egress to
work on our segregated networks.

------
ilaksh
My feeling is that I just want AWS Lambda functions to allow for larger
packages. And then clusters and scaling and OS updates etc. are Amazon's
problem.

I would rather just be able to do that than get into ECS/Fargate, much less
K8s. It seems like all of that stuff is just adding more complexity for me.

Of course none of my projects are gigantic or need to be highly secure.

------
tannhaeuser
I wonder if mesos with it's somewhat more tidied up infrastructure and
codebase and less moving parts could make it as a k8s replacement, if you
absolutely must run Linux container workloads (though I understand mesos can
run a anything not just Docker-/runc-like images).

------
egberts1
This is why I ditched systemd, et. al. and went bareboned embedded Linux on
QEMU. Then build up using Makefile. Probably could have used Yast, but didn’t
want the huge pullin of libraries.

BUT, my prototyping turnaround is way much faster and more stable than K8s.
Way less staffing too.

------
pjmlp
History timeline misses the container technologies across commercial UNIXes
like HP-UX, Tru64 and Solaris, and mainframes.

However the main point still stands that k8s is overkill for like 90% of
deployments using it.

------
rexarex
‘We couldn’t figure out k8s, so read our blog post on why it sucks’

~~~
maigret
That's really oversimplifying. EVERY tech has learning cost, and so a return
on investment. K8s has been so hyped that many developers think it's a
mandatory skill. It's not. I've seen talented app developers not getting
through successfully deploying an app on K8s. For many, Heroku or Cloud
Foundry is good enough.

Even in their case, it looks like they took a conscious decision and
documented it. I'd wish more teams would do that. You are free to use K8, and
I'm sure right now it's still not a controversial choice that will run against
walls.

------
nijave
imo running k8s in the cloud makes no sense. If you've already setup on prem
servers, load balancers, config management, patching, access control, etc to
allow developers to run applications on VMs k8s can provide an integrated
experience with significantly less work. If you're running k8s in the cloud
then just use a hosted service and leverage enterprise support.

In the on prem case, you already have a dedicated infra team that probably has
the tools to effectively deploy and manage a cluster.

~~~
CameronNemo
> In the on prem case, you already have a dedicated infra team that probably
> has the tools to effectively deploy and manage a cluster.

Tools? Yes. Care and attention? Hell no.

The ops team at my work still has important services running on CentOS 6, and
they are spending all their time trying to get Kubernetes configured and
working.

You would be surprised at how much infrastructure out there is just glued
together adhoc. Several senior engineers at my work are successfully blocking
adoption of gitops and CD.

------
mateioprea
You forgot to push the Gemfile
[https://github.com/coinbase/odin](https://github.com/coinbase/odin) here :)

------
mantenpanther
I really like the simplicity of Swarm and continue to use it.

------
namelosw
Kubernetes is great but most projects don't really need it.

Better go with Heroku equivalents than feeding a whole team, before you really
need a whole team to feed Kubernetes.

~~~
p_l
Great if the pricing works out for you.

Unfortunately most cases if I used Heroku I'd drive the cost way too high, as
Heroku (or cloud) pricing doesn't change with local developer costs.

~~~
namelosw
That's why I said 'equivalents'.

There are also application engines in major cloud providers, base on your
technology. Some tech stack can run serverless. For self-host scenarios,
things like Dokku could also work.

~~~
p_l
It's not a matter of _which_ product you use, more that "setup on unmanaged"
often infers huge enough savings that you can pay for more engineers.

Cause in many places the cost of fully loaded experienced engineer might be
_cheaper_ than managed services.

------
matthewcford
Another good alternative is Convox.

------
modeless
tl;dr We are all in on AWS proprietary stuff instead.

~~~
adrianscott
Yes, except more like 'in addition to' than instead... unbelievable... #lockin

~~~
calt
To me it looks like they saved quite a lot of engineering effort, and the
price was lockin. Seems like it was probably a fair trade for them.

At least if it's sane and well thought out it can be taken back apart and
repurposed for something different. There will be cost. Consider it paying
back the loan. That's why it's called technical debt. The terms of the debt
looks good to me.

------
jb_gericke
So the tldr is we thought k8 was too complicated so hand rolled out own
orchestration tool which is way better...?

------
svntid
they refrain from adopting kube because they claim they don't have anyone on
the team that is able to pull that off - face palm

------
punnerud
K8S is to infrastructure, as SQL is to data?

~~~
quickthrower2
As mongodb is to data

------
jwildeboer
TL;DR “we’re completely dependent on AWS and have given in to vendor lock-in”
;)

------
danenania
"For example, most folks that run large-scale container orchestration
platforms cannot utilize their built-in secret or configuration management.
These primitives are generally not meant, designed, or built for hundreds of
engineers on tens of teams and generally do not include the necessary controls
to be able to sanely manage, own, and operate their applications. It is
extremely common for folks to separate their secret and config. management to
a system that has stronger guarantees and controls (not to mention scaling)."

Fyi to anyone who is reaching (or close to reaching) this point: this is
exactly the problem we're trying to solve with EnvKey[1] in a secure, robust,
and scalable way.

Our current product runs on the cloud (using zero-trust end-to-end encryption)
and solves the problem quite well imho for companies with up to 50 or so
engineers and moderately complex infrastructure. It fits easily into either
containerized or non-containerized stacks.

But our v2 is almost done after years of work, and I think it will now be able
to handle almost any scale and workload. The launch target is August 1st. Some
cool features it will offer:

\- Source available self-hosting with auto-scaling, HA, and strong consistency
that just works.

\- "Config blocks" that can be used in multiple projects, allowing you to de-
duplicate configuration and secrets.

\- Version control with simple and advanced rollback capability.

\- Comprehensive access logs with simple and advanced filtering and auditing
capabilities.

\- Ability to manage local environments.

\- Ability to react to updates and e.g. restart servers when config values
change.

\- A CLI that will have full parity with the UI, either for automation or
those who prefer a CLI-based workflow.

\- An option to use our UI via an incognito web browser pointed at localhost
instead of Electron (for those who hate Electron)

\- Faster, lighter, modernized end-to-end encryption built on NaCl (v1 uses
OpenPGP, which is great, but it's time to move on).

\- Device based auth with optional passphrases (think SSH) and an easy, secure
workflow for granting access to new devices.

\- Ability to authenticate and invite users via Github, Gitlab, Google, Okta,
or SAML (including inviting from multiple sources within a single org).

\- Teams/groups and advanced group management: grant teams of users access to
groups of apps, connect groups of blocks to groups of apps, etc.

\- Customizable everything: environments, sub-environments, and access roles
can all be molded to fit your workflow.

Our goal is to fully solve this crucial piece of the stack so that it "just
works" with minimal time spent on integration (for new projects it can be
installed and integrated in minutes). If you're interested (and/or want early
access), submit your email to the form at the bottom of our site:
[https://www.envkey.com](https://www.envkey.com) \-- and of course if the v1
can solve your configuration and secrets management needs as it already does
for hundreds of our customers, give that a shot! Upgrading from v1 to v2 will
be very quick so you won't be duplicating any work.

Also, we're hiring--remote anywhere in the US. We'll put up a jobs page with
more detail soon, but our stack is TypeScript (node + react), Go, and polyglot
(since we need to write lots of integrations). Shoot me an email if this
sounds like a system you'd be excited to work on: dane@envkey.com

~~~
sk0g
Unfortunately I live in not-US (Aus), but I have to say, envkey looks very
good! I'm actually floating this and HashiCorp Vault by management at the
moment, since even with 4 devs, keeping environment variables in sync is a
PAIN!

If you don't mind me asking, how does the Go integration work? I initially
thought it was actually some sort of alternate `os.Getenv()` that you
imported, but that doesn't seem to be the case. And what would the latency be,
for changes in environment variables being synced to running deployments?

~~~
danenania
The go libary (along with our other language-specific libs) is a thin wrapper
around envkey-fetch -- [https://github.com/envkey/envkey-
fetch](https://github.com/envkey/envkey-fetch)

When a process starts, envkeygo makes a request to our config service (via
envkey-fetch) to fetch the encrypted config, then decrypts the config and sets
them on the environment so they can be retrieved with `os.Getenv`. Both the
lookup id (for fetching the encrypted config) and the encryption passphrase
for decryption are initially passed in via an ENVKEY=... environment variable
--you can think of it as a single environment variable that "expands" into all
the others that you need.

Latency on a request is generally in the 150-300ms range from the US (our
primary servers are in us-east-1).

EnvKey is strongly consistent and transactional, so once you make a change to
your config, it will be available immediately for any subsequent requests. For
now, it's still up to you to restart servers/services yourself after a change.
With the v2, this will be scriptable.

We'll eventually hire outside the US too, but for now I'm trying to keep the
timezone spread and administrative burden low :)

------
encoderer
At this point I’m of the belief that adopting k8s at most companies should
result in shareholder/investor lawsuits for the near-criminal waste of
corporate resources.

~~~
banifo
I'm not sure why.

My personal experience is very good and i'm looking forward to a bright
future.

