how is this acceptable?

byroot · on Aug 26, 2016

You have to design for those failures. In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

In the end, at this scale even with four or five nines of reliability, you'd still have to deal with 80 or 8 failures everyday. So we would have to be resilient to those crashes anyway.

However it's a lot of wasted computing and performance that we'd love to get back. But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Now maybe another container technology is more reliable, but at this point our entire infrastructure works with Docker because besides those warts it gives us other advantages that makes the overall thing worth it. So we stick with the devil we know ¯\_(ツ)_/¯.

zeveb · on Aug 26, 2016

> In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

You spawn 200 containers for one build‽ Egad, we really are at the end of days.

> But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Since containers are just isolated processes, wouldn't just running processes be just as fast (if not slightly faster), without requiring 200 containers for a single build?

byroot · on Aug 26, 2016

> wouldn't just running processes be just as fast

The applications we test with this system have dependencies, both system packages and datastores. Containers allow us to isolate the test process with all the dependant datastores (MySQL, Redis, ElasticSearch, etc)

If we were to use regular processes we'd both have to ensure the environment is properly setup before running the tests, and also fiddle with tons of port configurations so we can run 16 MySQLs and 16 Redises on the same host.

See my other comment for more details https://news.ycombinator.com/item?id=12366824

dominotw · on Aug 26, 2016

CI can just recover from these error by retrying/restarting containers.

falsedan · on Aug 26, 2016

Not Dead containers (which failed their post-shutdown cleanup).

segmondy · on Aug 26, 2016

"move fast and do'break shit" philosophy.