You have to design for those failures. In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.
In the end, at this scale even with four or five nines of reliability, you'd still have to deal with 80 or 8 failures everyday. So we would have to be resilient to those crashes anyway.
However it's a lot of wasted computing and performance that we'd love to get back. But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.
Now maybe another container technology is more reliable, but at this point our entire infrastructure works with Docker because besides those warts it gives us other advantages that makes the overall thing worth it. So we stick with the devil we know ¯\_(ツ)_/¯.
> In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.
You spawn 200 containers for one build‽ Egad, we really are at the end of days.
> But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.
Since containers are just isolated processes, wouldn't just running processes be just as fast (if not slightly faster), without requiring 200 containers for a single build?
The applications we test with this system have dependencies, both system packages and datastores. Containers allow us to isolate the test process with all the dependant datastores (MySQL, Redis, ElasticSearch, etc)
If we were to use regular processes we'd both have to ensure the environment is properly setup before running the tests, and also fiddle with tons of port configurations so we can run 16 MySQLs and 16 Redises on the same host.