
Our nightmare on Amazon ECS - maslam
http://www.appuri.com/blog/-our-docker-nightmare-on-amazon-ecs/
======
tjholowaychuk
I do think they need to put more effort on CLIs etc, instead of relying on OSS
to fulfill this niche, or at very least put more effort into supporting OSS.

Lambda is similar, we have 'Serverless' and I'm hacking on Apex
([https://github.com/apex/apex](https://github.com/apex/apex)) just to make it
usable. I get that they want to create building blocks, but at the end of the
day consumers just want something that works, you can still have building
blocks AND provide real workable solutions.

I was part of the team migrating Segment's infra to ECS, and for us at least
it went pretty well, some issues with agents disconnecting etc I sort of wrote
off since ECS was so new at the time.

Another annoying thing not mentioned in the article is that the default AMI
used for ECS is not _at all_ production ready, you really have to bake your
own images if you want something usable. I suppose this is maybe because
there's subjectively no "good" defaults, I'm not sure, but it's a bit of a
pain.

ELB for service discovery is fine if you can afford it, I had no issues with
that, ELB + DNS keeps things very simple. I'm not a huge fan of all these
complex discovery mechanisms, in most cases I think they're completely
unnecessary unless you're just looking to complicate your infrastructure.

I also think in many cases _not_ propagating global config (env) changes, is a
good thing, depending on your taste. Scoping to the container gives you nice
isolation and and more flexibility if you need to redirect a few to a new
database for example. You don't have to ask your-self "shit, which containers
use this?", it's much like using flags in the CLI, if we _all_ used
environment variables in place of every flag it would be a complete mess.

EDIT: I forgot to mention that the ELB zero-downtime stuff was awesome, if you
try and re-invent that with haproxy etc, then... that's unfortunate haha. No
one should have to implement such a critical thing.

~~~
nathanboktae
> I also think in many cases not propagating global config (env) changes. ...
> You don't have to ask your-self "shit, which containers use this?",

I agree and we (dev lead at Appuri here) achieve the best of both worlds from
Kube by in the secrets section of a deployment definition, specifying what
secrets we need, but not the value. So we know what services need it, and it's
updated in one place. That's just for the secret store though, but we could
put non-secrets in secret to use that mechanism.

~~~
cpitman
Have you looked at ConfigMaps? They're a newer feature of Kubernetes that is
meant for storing non-secrets, but in general works pretty similarly to how
secrets work (create config map, mount in container).

~~~
velkyk
Sure, we use configMaps, secrets, mounting EBS, you name it. Implementing k8s
felt like Jack in Titanic getting into first class :). Nice to know they won't
lock you up when the boat is going to sink :).

configMaps is nice, but we use it in limited way because its so much easier to
update pods when editing env vars. Note: we are using deployments, so if you
need to change env var, you do `kubectl edit deployment <name>`,
edit/save/close file that opened in your $EDITOR and watch the magic to
happen.

------
dperfect
> ECS doesn't have a way to pass configuration to services

I believe this is the recommended way:

ECS container instances automatically get assigned an IAM role[1], with
credentials accessible via instance metadata (169.254.169.254) [2]. Containers
can access that metadata too. The AWS SDK automatically checks that metadata
and configures itself with those credentials, so all you have to do is give
your IAM role access to a private S3 bucket with configuration data and load
that configuration when booting up your app.

That way there's no need to copy/paste variables, and no leaking secrets in
ENV variables. You _do_ have to be careful though (as with any EC2 instance)
not to allow outside access to that instance metadata endpoint, e.g., in a
service that proxies requests to user-defined hosts on the network (but if
you're doing that, you've got a lot more to worry about anyway).

[1]
[http://docs.aws.amazon.com/AmazonECS/latest/developerguide/i...](http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html)

[2] [http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-
roles...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-
amazon-ec2.html)

~~~
embiggen
One reason I am hesitant to go this route is because I don't want to hard-code
Amazon's API's into my apps..

~~~
dperfect
I understand the reluctance to add extra dependencies (especially environment-
specific ones), but in the case of a typical Ruby app, it amounts to the 'aws-
sdk' gem and 1 or 2 lines in an initializer.

For my own purposes, I weighed that against the alternatives[1], and it seems
like a fairly reasonable compromise[2]. That won't be the case for everyone,
obviously.

[1] [http://elasticcompute.io/2016/01/21/runtime-secrets-with-
doc...](http://elasticcompute.io/2016/01/21/runtime-secrets-with-docker-
containers/)

[2] I'm referring specifically to passing _secrets_ (or other static values)
into a container, since that seems to be what the author was talking about.
For configuration requiring more complexity, of course other tools are
probably more appropriate. In that case, it's outside the scope of what I
would reasonably expect ECS to do.

------
nzoschke
Thanks for the shoutout to Convox! I'm on the core team.

I understand these challenges. I wrote about a lot of them here:

[https://convox.com/blog/ecs-challenges/](https://convox.com/blog/ecs-
challenges/)

But we have been having tons of success on ECS both for our own stuff and for
hundreds of users.

I see the agent disconnection problem too. convox automatically marks those as
unhealthy and the ASG replaces them.

It's happening more than I'd like but I'm seeing little to no service
disruption. One of the root causes is the docker daemon hanging.

Glad Kubernetes is working well for you. Many roads lead to success as the
cloud matures.

~~~
maslam
That's a great blog post. Thanks for sharing!

------
cddotdotslash
What I don't understand is why AWS squandered this opportunity. Given the
popularity of Lambda, they clearly saw the market for completely managed
services. They could have designed a platform where users upload containers
and AWS runs them. No servers, no crazy settings, etc. Instead, they created
this entire platform where you still have to run the entire EC2
infrastructure, there is no service discovery, etc. They essentially created a
half baked Mesos or Kubernetes clone. I'm still shocked when I hear companies
going "all in" on ECS.

~~~
codemac
A good friend told me that he felt that Google Cloud and AWS had "a severe
lack of imagination".

Imagine what you could do if you didn't even assume a process model? All app
state just resident in memory, but magically persisted? Who needs object
storage, re-invent the pointer!

We could have lived in the future, now it seems we're permanently wed to the
past.

~~~
nikanj
I heartily recommend this essay
[http://scholar.harvard.edu/files/mickens/files/thenightwatch...](http://scholar.harvard.edu/files/mickens/files/thenightwatch.pdf)
, for gems like "Pointers are real. They’re what the hardware understands.
Somebody has to deal with them. You can’t just place a LISP book on top of an
x86 chip and hope that the hardware learns about lambda calculus by osmosis."

~~~
codemac
I love James Mickens with all my heart.

All of it.

------
cxmcc
Our experience with ECS (at instacart) is not the best but we managed to get
it work.

Here is how we get around the issues mentioned in the article:

* Service discovery: built our own with rabbitmq (we use that before ECS anyway).

* Configs: pass a s3 tarball url as environment variable, download it in containers.

* Cli: built our own with help of cloudformation

* Agent disconnecting: we did not see situation where all agents disconnected. we use a large pool of instances, there was never an issue to start containers because of agents.

In addition to these, we also do the following to make ECS work as we want it
to:

* built our blue-green deploy solution (structure provided by ECS is very limited)

* built our own solution to integrate with ELB (ELB allows only one port per ELB)

------
cyberferret
Excuse me while I pick myself up off the floor when I read "leaks environment
variables"... What?? That is incredibly scary for use, as we just went through
an audit process of our code on about 6 different web apps to ensure that all
secrets were placed in environment variables on our Elastic Beanstalk configs
and not in the main codebase... If this now results in LESS security (all our
code is in private Git repositories) than before, then we have essentially
taken a step backwards!

~~~
dorfsmay
Github private repos have been made public by mistake before. Got repos are
cloned on dev laptops, do you enforce laptop encryption?

The right thing to do is using some form of a vault.

~~~
cyberferret
We use BitBucket here, rather than Github - similar risks, I know, but we have
predetermined repositories which are all set as private. 3 dev machines which
are kept on premises at all times.

Still not optimal as far as security goes, but it seems that he have roughly
the same exposure if AWS leaks our keys and passwords to other third party
trackers...

~~~
fletchowns
Be careful when modifying user access to a private BitBucket repository. Their
autosuggest for the username input field will show _all_ bitbucket users.
Makes it incredibly easy to accidentally grant somebody outside of your
organization access to a repository.

~~~
iampims
I can attest to that. I have the username "tim" on bitbucket and get added to
private repos all the time by mistake.

~~~
fletchowns
On top of that, there's no audit history of who has had access to a
repository. Absolutely ridiculous.

------
jbaviat
We have been running Sqreen production on ECS since October 2015, and we have
been pretty happy about it the whole time. Of course ECS was very minimal at
the beginning, then many stuff improved, allowing for easier deploy, easier
logging, and finally easier auto-scaling. When the ECR (AWS managed registry)
was added to our region, it was quite a party @Sqreen :) I would see no point
leaving it for something else today.

A remaining issue is that you cannot spawn two containers speaking to a given
ELB (AWS load balancer) on the same host if they need to bind the same port.

------
hosh
I put something into production with ECS as well. I ran into the same missing
components too -- lack of service discovery, and such. Kubernetes work a lot
better. As it stands right now, I wouldn't take a gig involving putting ECS
into production.

Now if ECS 2.0 was really AWS hosted Kubernetes, I would be very interested in
hearing about that...

~~~
tantalic
Google Container Engine (GKE) is certainly the easiest way to setup a
Kubernetes cluster. We have been running it for a couple of months now and
couldn't be happier. If you're wanting to stick with AWS I have always heard
great things about the work CoreOS has been doing in this space:
[https://github.com/coreos/coreos-
kubernetes](https://github.com/coreos/coreos-kubernetes).

~~~
alex-mohr
It's great to hear GKE is meeting your needs so well! (Yes, I work on it.)

For 1.4, the Kubernetes Cluster Lifecycle and Ops SIGs are working on making
the install and setup process much easier, including on AWS [1]. That won't
magically turn it into Kubernetes as a Service, of course, but we hope it'll
help users on other platforms.

[1]: [https://github.com/kubernetes/community/blob/master/sig-
clus...](https://github.com/kubernetes/community/blob/master/sig-cluster-
lifecycle/README.md)

------
Ixiaus
Probably want to use a secret management tool and just not initialize into
environment variables...

[https://github.com/fugue/credstash](https://github.com/fugue/credstash)

------
rjurney
Running DCOS (data center operating system) on AWS is a snap, and solves all
these problems. It makes running docker images a no-brainer compared to all
other solutions, and this includes docker images that interact with one
another (not just 100 apache servers or whatever). It is the best software I
have ever used, hands down. It is the second coming of zeus buddha jesus
belly. It makes scaling anything in the cloud easy. No, I do not work there.
No, I am not exaggerating. Yes, I spent a month fucking with swarm and service
discovery before deploying a large cluster of my service in two days on DCOS.

Docker is stuck in the 'one image on one machine' mindset. DCOS is taking over
at the higher levels of the stack. Mark my word.

[https://dcos.io/](https://dcos.io/)

~~~
maslam
@rjurney - we started with ECS right around when DCOS was coming out of alpha
(?). Anyway, it looks slick!

------
huslage
Environment Variables are NEVER private. Please don't think that you can hide
information in there as all of that information ends up in the process table
which is public across the entire machine.

~~~
nnutter
How are they "public" across the entire machine?

~~~
phil21
I suppose it depends on your definition of public.

Any environment for a process will be accessible via /proc/<pid>/environ on a
linux system. Of course other users cannot read these files, however in the
case of something like a Docker image all processes likely share a username
and this could be a risk (especially for a public webapp that one day may
information leak/allow remote command execution).

At least that's my immediate take on it.

------
maslam
HN, I'm a co-founder at Appuri. Happy to answer questions! PS: We LOVE most
AWS services like Amazon Redshift. Just not ECS ;)

~~~
dstroot
Did you deploy K8S on AWS? If so can you add any details about how? Or are you
using K8S elsewhere? I love AWS but planning on spinning up on GCE this
weekend to play with K8S.

~~~
velkyk
I am ops at Appuri. We deployed k8s on AWS since we are using other services
like Redshift and RDS inside the VPC, also happy with how EC2 works plus of
course we have Reserved instances so we didn't look into GCP yet. We're
running kube on CentOS 7, we bootstrap nodes using cloud-init (user-data) to
setup k8s, which we then use to run everything else. I would love to give you
more details, I might write blog post about our kube setup decisions later.
Kelsey wrote nice manual for setting up k8s -
[https://github.com/kelseyhightower/kubernetes-the-hard-
way](https://github.com/kelseyhightower/kubernetes-the-hard-way) which is
definitely on my to-read list this weekend :)

~~~
boulos
How I wish we had Postgres support in Cloud SQL already...

Depending on how many RIs you have, maybe our custom shapes, much lower prices
with per-minute billing and sustained usage discounts, and BigQuery vs
RedShift you'd still come out ahead. You could always try your hand at
reselling your RIs on the "market".

Would you rather run pgsql by hand or K8s? (And yes, I don't want to make you
choose, that's just the current state of the world).

Disclosure: I work on Compute Engine, so of course we want your business.

~~~
velkyk
heh, I would rather run psql as a service which is what we have with RDS :).
We are aware of BigQuery advantages, Redshift resizing is tedious job, but we
currently don't think that the cost of migrating to GCP would be so beneficial
that it is worth the effort. This can of course change in future, but for now
we are reasonably happy with our setup on AWS and its reliability.

~~~
vgt
Perhaps you can take a look at some folks who've done this with great success:
[https://cloud.google.com/customers/sharethis/](https://cloud.google.com/customers/sharethis/)
[https://youtu.be/6Nv18xmJirs?t=8m7s](https://youtu.be/6Nv18xmJirs?t=8m7s)

------
x0rg
I hope the future of cloud will really be managed OSS as service. Google is
doing a great job with Kubernetes and GKE and I hope the other providers will
understand that. Microsoft is on the right way with DCOS as service, Amazon is
just not there yet.

------
advisedwang
[off topic] Author, if you are reading this be aware that when viewed in a
narrow browser window the sharing icons overlap the text, even though 40% of
the screen is taken up by the right hand sidebar/empty space.

~~~
maslam
Thanks @advisedwang. We're looking into it.

------
graffitici
Anybody has insights about using Docker Swarm? I imagine Kubernetes has been
battle-tested way more in production, especially by the likes of Google. But
from what I understand, Docker is really pushing swarm. I'd be curious to hear
if others even considered Swarm before choosing K8s..

~~~
lobster_johnson
There's not really any comparison. Docker is clearly beefing up Docker/Swarm
to be more like Kubernetes, but in its current state, Swarm is just a
glorified Docker Compose.

For example, it does not handle services (K8s can automatically provision a
load balancer against all your containers), there's no volume handling, no
centralized logging, no label-based targeting, it has very limited scheduling
(K8s uses cAdvisor to help scheduling, can automatically ensure that pods are
spread out across multiple AZs, etc.), etc.

It'll be interesting to see what happens as Docker starts pushing into
Kubernetes' space. Given the multiple points of overlap/contention between K8s
and Docker (you have to disable Docker's built-in networking and iptables
management; Kubelet has to continually monitor Docker for orphaned containers
and volumes and so on; etc.) I wouldn't be surprised if Google one day decides
to eliminate the Docker daemon as a dependency entirely, by writing a bare-
bones container engine into Kubelet.

~~~
nakagi
Really? I also think Docker Swarm Mode is still behind of K8S, but as far as I
read the doc, they support \- load balancing between container \- volume
handling
[https://docs.docker.com/engine/tutorials/dockervolumes/](https://docs.docker.com/engine/tutorials/dockervolumes/)
\- label-based constraint

I know some features are not so sophisticated compared with K8S and there is
no AZ awareness, but Swarm may try to catch up with it.

~~~
lobster_johnson
I recommend looking into the Kubernetes design to understand how different its
design is.

A good example is volume management. With Kubernetes, you can tell a pod to
use an AWS EBS volume; when the pod needs the volume, Kubernetes will
automagically mount it, and handle the statement management for you.

If you define what's called a persistent volume, your pod can declare that it
needs, say, 1GB, and Kubernetes will automatically allocate 1GB from the
volume; you can have lots of pods working off this shared volume, and
Kubernetes will know which pods have "claimed" which parts of the volume.

Another good example is config and secrets. In Kubernetes, you declaratively
create configuration objects ("configmaps") and secrets. If a pod needs, say,
access to an external API, you can store the keys in a secret and
declaratively give the pod access to the secret, which will be mounted into a
folder (or, alternatively, assigned to an environment variable, though that's
not as secure).

Yet another example is service management. You can tag a service (which is
another type declaration that says "port X on some unique cluster IP should be
routed to every pod tagged with these labels") as load-balanced, and if you're
running in a cloud environment (AWS, GCE, etc.), K8s can automatically create
an external load balancer for you that exposes the service publicly.

Kubernetes is best described as a sophisticated state machine that takes
declarative objects ("manifests") that describe your world — i.e. which
containers should be running, which services should be exposed, etc. — and
then attempts to continuously reconcile reality with that declaration,
managing all sorts of state in the process.

Perhaps most important is the ability to abstract resources from pods. A pod
just declares the image to run and the resources — volumes, configs, secrets,
CPU/memory constraints, etc. — to make available to it. K8s's state machinery
takes care of the rest.

As far as I know, Docker Swarm has none of this, and you'd have to build these
things (e.g. REXRay for volumes) on top of Swarm yourself.

~~~
nakagi
Hmm, I just want to clarify you're talking about Docker Swarm of Docker v1.11-
[https://docs.docker.com/swarm/](https://docs.docker.com/swarm/) or a new
built-in Docker Swarm Mode from 1.12+.
[https://docs.docker.com/engine/swarm/](https://docs.docker.com/engine/swarm/)

The latter obviously borrows a lot of design and concept from k8s, so I
thought the design is not so different as previously they were. It just
doesn't have some(or a lot, so far) cool features that k8s already provides
(it's still in a RC stage)

~~~
lobster_johnson
Ah, I don't know the new Swarm Mode at all. A cursory look does make it seem
like it's very much copying K8s.

------
siliconc0w
We evaluated ECS and Beanstalk but ended up writing a tool around building
CoreOS/Fleet clusters (not currently opensource but I'm trying).

We ran into similar complaints. CoreOS comes with Etcd which though initially
unstable is now solid and incredibly handy for service discovery and
configuration. We're using [https://github.com/MonsantoCo/etcd-aws-
cluster](https://github.com/MonsantoCo/etcd-aws-cluster) to configure it
dynamically. We use etcd+confd to drive nginx containers for routing. All in
all it works well. Our biggest problems are docker bug related and those we
can generally handle by just terminating the node and letting autoscale heal
the cluster.

------
justicezyx
“No central config. ECS doesn't have a way to pass configuration to services
(i.e. Docker containers) other than with environment variables. Great, how do
I pass the same environment variable to every service?”

Would packaging the configurations together with the docker image makes more
sense? That enables more hermetic deployment.

~~~
velkyk
Do you mean hard coding configs to docker image? I wouldn't support this, IMO
this is worst case scenario setup :)

Imagine you need to change single config value, for this you would need to
update image, push, build, redeploy, this can take some time depending on your
deployment.

With k8s you do only `kubectl edit configmaps <name>`, restart pods that are
using it and you are done.

Also no need to creating per stage images...

------
pbkhrv
We switched from ECS to Docker Cloud and never looked back.

------
SteveWatson
Article text is obscured by icons.

~~~
maslam
@SteveWatson - thanks for reporting, should be fixed now.

